Least Squares Linear Regression

When $f$ is a linear function with respect to the parameters $\boldsymbol\beta$, it is referred to as linear least squares regression (or Ordinary Least Squares, OLS). This function can be represented in the form of a linear system.

\begin{displaymath}
y_i = \mathbf{x}_i \boldsymbol\beta + \varepsilon_i
\end{displaymath} (3.11)

where $\boldsymbol\beta$ are the unknown parameters to be estimated and $\varepsilon_i$ is zero-mean white Gaussian additive noise. The parameters $\boldsymbol\beta$ are the regression coefficients: they measure the association between the variable $\mathbf {x}$ and the variable $y$.

Each observation is a constraint, and all individual constraints can be collected in matrix form.

\begin{displaymath}
\mathbf{y} = \mathbf{X} \boldsymbol\beta + \boldsymbol\varepsilon
\end{displaymath} (3.12)

$\mathbf{y} \in \mathbb{R}^n$ is the vector of responses (dependent variables), and the matrix $\mathbf{X} \in \mathbb{R}^{n \times m}$ The collection of independent variables (explanatory variables) is referred to as the design matrix, and finally, $\boldsymbol\varepsilon$ is the vector of additive noise with a mean of zero $\E[\boldsymbol\varepsilon]=0$ and variance $\boldsymbol\Sigma$. The parameter vector $\boldsymbol\beta$ is called the Linear Projection Coefficient or Linear Predictor. The random variable $\mathbf{y}$ is therefore composed of a deterministic part and a stochastic part.

The objective is to find the hyperplane $\boldsymbol\beta$ in $m$ dimensions that best fits the data $(\mathbf{y},\mathbf{X})$.

The value $\boldsymbol\beta$ that minimizes the cost function defined in equation (3.6), specifically in the case of observation noise with a mean value of zero and constant variance across all samples, is indeed the best linear estimator that minimizes variance (the Best Linear Unbiased Estimator BLUE).

Definizione 9   The Best Linear Unbiased Estimate (BLUE) of a parameter $\boldsymbol\beta$ based on a dataset $Y$ is
  1. a linear function of $Y$, such that the estimator can be expressed as $\hat{\boldsymbol\beta} = \mathbf{A} Y$;
  2. it must be unbiased ( $\E [\mathbf{A} Y]=0$),
  3. among all possible linear estimators, it is the one that produces the minimum variance.

The Gauss-Markov theorem demonstrates that a least squares estimator is the best choice among all minimum variance BLUE estimators when the variance of the observations is constant (homoscedastic).

The best least squares estimate $\hat{\boldsymbol\beta}$ that minimizes the sum of the residuals is the solution to the linear problem.

\begin{displaymath}
\hat{\boldsymbol\beta} = \argmin_\mathbf{b} \Vert \boldsymbo...
...mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}
\end{displaymath} (3.13)

The same result was already obtained in section [*] concerning the pseudoinverse of a matrix: an SVD decomposition of the matrix $\mathbf{X}$ also yields the best solution in terms of minimizing the propagation of computational errors.

The matrix $\mathbf{P}$, defined as

\begin{displaymath}
\mathbf{P} = \mathbf{X} (\mathbf{X}^{\top} \mathbf{X} )^{-1} \mathbf{X}^{\top}
\end{displaymath} (3.14)

is a projection matrix that transforms the outputs (response vector) $\mathbf{y}$ into their estimate $\hat{\mathbf{y}}$ (the estimate of the observation without noise):
\begin{displaymath}
\mathbf{P}\mathbf{y}_i = \mathbf{x}_i \hat{\boldsymbol\beta} = \hat{\mathbf{y}}_i
\end{displaymath} (3.15)

Due to this property, $\mathbf{P}$ is referred to as the hat matrix.

In the case of noise with non-constant variance among the observed samples (heteroscedastic), weighted least squares regression is the BLUE choice.

\begin{displaymath}
w_i = \frac{1}{\sigma_i}
\end{displaymath} (3.16)

with $w_i > 0$ that take into account the various uncertainties associated with each observation $y_i$ such that $1/w_i$ is the standard deviation of the i-th measurement. By inserting the weights $w_i$ into a diagonal matrix $\mathbf{W}$, a new linear system is obtained where each row effectively has the same observation variance. The solution that minimizes $\boldsymbol\varepsilon$ can always be expressed as
\begin{displaymath}
\hat{\boldsymbol\beta} = (\mathbf{W}\mathbf{X})^{+} \mathbf{W} \mathbf{y}
\end{displaymath} (3.17)

with $\mathbf{W}=\boldsymbol\Sigma^{-1}$.

Further generalizing, in the case of noise with non-constant variance among the observed samples and correlated with each other, the best BLUE estimate in the linear case must take into account the covariance of the noise $\boldsymbol\Sigma$:

\begin{displaymath}
\hat{\boldsymbol\beta} = (\mathbf{X}^{\top} \boldsymbol\Sigm...
...hbf{X})^{-1} \mathbf{X}^{\top}\boldsymbol\Sigma^{-1}\mathbf{y}
\end{displaymath} (3.18)

This estimator is referred to as Generalized Least Squares (GLS).

Such a system minimizes the variance

\begin{displaymath}
Var[\hat{\boldsymbol\beta}_{GLS}] = (\mathbf{X}^{\top} \boldsymbol\Sigma^{-1} \mathbf{X})^{-1}
\end{displaymath} (3.19)

Paolo medici
2025-10-22