Least Squares Linear Regression

When $f$ is a linear function with respect to the parameters $\boldsymbol\beta$ , it is referred to as linear least squares regression (or Ordinary Least Squares, OLS). This function can be represented in the form of a linear system.

$\begin{displaymath} y_i = \mathbf{x}_i \boldsymbol\beta + \varepsilon_i \end{displaymath}$

(3.11)

where $\boldsymbol\beta$ are the unknown parameters to be estimated and $\varepsilon_i$ is zero-mean white Gaussian additive noise. The parameters $\boldsymbol\beta$ are the regression coefficients: they measure the association between the variable $\mathbf {x}$ and the variable $y$

Each observation is a constraint, and all individual constraints can be collected in matrix form.

$\begin{displaymath} \mathbf{y} = \mathbf{X} \boldsymbol\beta + \boldsymbol\varepsilon \end{displaymath}$

(3.12)

$\mathbf{y} \in \mathbb{R}^n$ is the vector of responses (dependent variables), and the matrix $\mathbf{X} \in \mathbb{R}^{n \times m}$ The collection of independent variables (explanatory variables) is referred to as the design matrix, and finally, $\boldsymbol\varepsilon$ is the vector of additive noise with a mean of zero $\E[\boldsymbol\varepsilon]=0$ and variance $\boldsymbol\Sigma$ . The parameter vector $\boldsymbol\beta$ is called the Linear Projection Coefficient or Linear Predictor. The random variable $\mathbf{y}$ is therefore composed of a deterministic part and a stochastic part.

The objective is to find the hyperplane $\boldsymbol\beta$ in $m$ dimensions that best fits the data $(\mathbf{y},\mathbf{X})$ .

The value $\boldsymbol\beta$ that minimizes the cost function defined in equation (3.6), specifically in the case of observation noise with a mean value of zero and constant variance across all samples, is indeed the best linear estimator that minimizes variance (the Best Linear Unbiased Estimator BLUE).

Definizione 9 The Best Linear Unbiased Estimate (BLUE) of a parameter $\boldsymbol\beta$ based on a dataset is

a linear function of , such that the estimator can be expressed as $\hat{\boldsymbol\beta} = \mathbf{A} Y$ ;
it must be unbiased ( $\E [\mathbf{A} Y]=0$ ),
among all possible linear estimators, it is the one that produces the minimum variance.

The Gauss-Markov theorem demonstrates that a least squares estimator is the best choice among all minimum variance BLUE estimators when the variance of the observations is constant (homoscedastic).

The best least squares estimate $\hat{\boldsymbol\beta}$ that minimizes the sum of the residuals is the solution to the linear problem.

$\begin{displaymath} \hat{\boldsymbol\beta} = \argmin_\mathbf{b} \Vert \boldsymbo... ...mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \end{displaymath}$

(3.13)

The same result was already obtained in section

concerning the pseudoinverse of a matrix: an SVD decomposition of the matrix $\mathbf{X}$ also yields the best solution in terms of minimizing the propagation of computational errors.

The matrix $\mathbf{P}$ , defined as

$\begin{displaymath} \mathbf{P} = \mathbf{X} (\mathbf{X}^{\top} \mathbf{X} )^{-1} \mathbf{X}^{\top} \end{displaymath}$

(3.14)

is a projection matrix that transforms the outputs (response vector) $\mathbf{y}$ into their estimate $\hat{\mathbf{y}}$ (the estimate of the observation without noise):

$\begin{displaymath} \mathbf{P}\mathbf{y}_i = \mathbf{x}_i \hat{\boldsymbol\beta} = \hat{\mathbf{y}}_i \end{displaymath}$

(3.15)

Due to this property, $\mathbf{P}$ is referred to as the hat matrix.

In the case of noise with non-constant variance among the observed samples (heteroscedastic), weighted least squares regression is the BLUE choice.

$\begin{displaymath} w_i = \frac{1}{\sigma_i} \end{displaymath}$

(3.16)

with

that take into account the various uncertainties associated with each observation $y_i$

such that

is the standard deviation of the i-th measurement. By inserting the weights $w_i$

into a diagonal matrix $\mathbf{W}$ , a new linear system is obtained where each row effectively has the same observation variance. The solution that minimizes $\boldsymbol\varepsilon$ can always be expressed as

$\begin{displaymath} \hat{\boldsymbol\beta} = (\mathbf{W}\mathbf{X})^{+} \mathbf{W} \mathbf{y} \end{displaymath}$

(3.17)

with $\mathbf{W}=\boldsymbol\Sigma^{-1}$ .

Further generalizing, in the case of noise with non-constant variance among the observed samples and correlated with each other, the best BLUE estimate in the linear case must take into account the covariance of the noise $\boldsymbol\Sigma$ :

$\begin{displaymath} \hat{\boldsymbol\beta} = (\mathbf{X}^{\top} \boldsymbol\Sigm... ...hbf{X})^{-1} \mathbf{X}^{\top}\boldsymbol\Sigma^{-1}\mathbf{y} \end{displaymath}$

(3.18)

This estimator is referred to as Generalized Least Squares (GLS).

Such a system minimizes the variance

$\begin{displaymath} Var[\hat{\boldsymbol\beta}_{GLS}] = (\mathbf{X}^{\top} \boldsymbol\Sigma^{-1} \mathbf{X})^{-1} \end{displaymath}$

(3.19)

Paolo medici
2025-10-22