Least Squares Regression

Let's start by looking at the most common situation in real-world applications: when the measurements we collect are affected by random noise that follows a Gaussian (bell-shaped) distribution and is added to the data.

We can describe this with the equation:

\begin{displaymath}
y = f(\mathbf{x}, \boldsymbol\beta) + \varepsilon
\end{displaymath} (3.2)

Here, $f(\mathbf{x}, \boldsymbol\beta)$ is a function-usually nonlinear-that depends on some parameters $\boldsymbol\beta$and input values $\mathbf {x}$. The term $\varepsilon$represents the added noise, which is assumed to be Gaussian, with zero average and a certain variance $\sigma$.

To estimate the parameters reliably, we need a large number of input samples, written as $\mathbf{x}=\left\{ \mathbf{x}_1 \ldots \mathbf{x}_n \right\}$. Ideally, the number of samples should be much greater than the number of parameters we are trying to estimate.

One can consider that the function of the parameters may not be the same for all samples, but there could be different functions, in fact observing different quantities, always as a function of the same parameters $\boldsymbol\beta$. In this case, the equation (3.2) can be generalized as follows:

\begin{displaymath}
y_i = f_i(\boldsymbol\beta) + \varepsilon_i
\end{displaymath} (3.3)

where the subscript $i$ implicitly denotes both the type of function and the i-th input sample (essentially a constant parameter of the function).

The vector $\mathbf{r}$ is introduced, defined as

\begin{displaymath}
%r_i = y_i - f_i(\mathbf{x}_i, \boldsymbol\beta)
r_i = y_i - f_i(\boldsymbol\beta)
\end{displaymath} (3.4)

contains the residual associated with the i-th observation (or the i-th function). $r_i$ is a function of $\boldsymbol\beta$ just as $f_i$ is, and it shares the derivatives (up to a sign with this formalism).

To obtain a maximum likelihood estimator, the quantity to minimize is the negative log likelihood (section 2.8) of the function (3.2). In the case of Gaussian noise, the likelihood function can be expressed as

\begin{displaymath}
\mathcal{L}(r_i \vert \boldsymbol\beta, \sigma) = \frac{1}{\sqrt{2 \pi \sigma_i^2}} e^{ - \frac{r_i^2}{2\sigma_i^2} }
\end{displaymath} (3.5)

for independent observations. By applying the definition of negative log likelihood to the likelihood function, it follows that in the case of Gaussian noise, the maximum likelihood estimator is equivalent to the least squares method.

Least squares regression is a standard optimization technique for overdetermined systems that identifies the parameters $\boldsymbol\beta=(\beta_1, \ldots, \beta_m)$ of a function $f(\mathbf{x},\boldsymbol\beta): \mathbb{R}^{m} \mapsto \mathbb{R}^{n}$ that minimize an error $S$ calculated as the sum of squares (Sum Of Squared Error) of the residuals $r_i$ over a set of $n$ observations $y_1 \ldots y_n$:

\begin{displaymath}
S(\boldsymbol\beta) = SSE(\boldsymbol\beta) = \mathbf{r} \cd...
...= \sum_{i=1}^{n} { \Vert y_i - f_i(\boldsymbol\beta) \Vert^2 }
\end{displaymath} (3.6)

$S(\boldsymbol\beta)$ is defined as the residual sum of squares or alternatively as the expected squared error.

$S:\mathbb{R}^{m} \mapsto \mathbb{R}$ is a function that is analyzed by varying the parameters $\boldsymbol\beta \in \mathbb{R}^{m}$ to find its minimum value.

\begin{displaymath}
\beta^{+} = \argmin_\beta S(\beta)
\end{displaymath} (3.7)

For this reason, it is referred to as the objective function or cost function. A minimum obtained through a procedure such as that described by equation (3.7) is defined as a global minimum.

A global minimum is challenging to identify from a purely computational standpoint, and typically, techniques can only be employed to locate local minima.

Let us therefore consider $S(\boldsymbol\beta)$3.1 differentiable, that is, $f$ differentiable. The necessary condition for $\boldsymbol\beta$ to be a minimum is that, at that point in the parameter space, the gradient of $S(\boldsymbol\beta)$ vanishes, that is,

\begin{displaymath}
\frac{\partial S(\boldsymbol\beta)}{\partial \beta_j} = 2 \m...
...i(\boldsymbol\beta)}{\partial \beta_j} = 0 \qquad j=1,\ldots,m
\end{displaymath} (3.8)

A sufficient condition for a stationary point ( $S'(\boldsymbol\beta)=0$) to be a minimum is that $S''(\boldsymbol\beta)$ (the Hessian) is positive definite. Clearly, the existence of a local minimum guarantees only that there exists a neighborhood $\delta$ of $\boldsymbol\beta$ such that the function $S(\boldsymbol\beta + \delta) \geq S(\boldsymbol\beta)$.

All the discussions addressed so far assume that the noise is additive $\varepsilon$ with constant variance across all samples (homoscedasticity). In the case where the measurement noise is still Gaussian and additive with a mean of zero but with non-constant variance, each individual observation $y_i$ is an independent random variable associated with the variance $\sigma^{2}_i$. Intuitively, it is understood that the optimal regression in this case will need to give more weight to samples with low variance while giving less weight to samples with high variance.

To achieve this result, a normalization is employed, similar to that shown in section 2.4.1 and a direct consequence of the likelihood in equation (3.5). Therefore, one should no longer minimize the simple sum of squared residuals, but rather the weighted sum of the residuals:

\begin{displaymath}
\chi^{2} = \sum^{n}_{i=1} \frac{ \Vert r_i \Vert^2 } { \sigma_i }
\end{displaymath} (3.9)

The cost function, now the sum of a random variable with unit variance squared, becomes a chi-squared distribution and is thus referred to as $\chi^{2}$. The minimum of this cost function coincides with that obtained previously from the least squares when the variance is constant instead. The condition (3.8) for obtaining the minimum is also modified accordingly:

\begin{displaymath}
%\sum_{i=1}^{n} \frac{r_i}{\sigma_i} \frac{\partial f(\math...
...i(\boldsymbol\beta)}{\partial \beta_j} = 0 \qquad j=1,\ldots,m
\end{displaymath} (3.10)

Further generalizing this concept, when the observation is subject to Gaussian noise with a known covariance matrix $\boldsymbol\Sigma$, the Weighted Sum of Squared Error (WSSE) can ultimately be expressed as It seems that you have entered a placeholder or a command related to a LaTeX environment. If you have specific content or text that you would like me to translate from Italian to English, please provide that text, and I will be happy to assist you! It is noteworthy that this formulation of the cost function is equivalent to that of equation (3.6), where, however, the Mahalanobis distance is used instead of the Euclidean distance (section 2.4).

Any Weighted Least Squares can be reduced to an unweighted problem $\boldsymbol\Sigma = I$ by premultiplying the residuals $\mathbf{r}$ (and consequently the derivatives) by a matrix $\mathbf{L}^{\top}$ such that $\boldsymbol\Sigma^{-1} = \mathbf{L} \mathbf{L}^{\top}$, using, for instance, a Cholesky decomposition in the case where this matrix is not diagonal.

All these estimators, which take into account the variance of the observation, coincide with the negative log likelihood for the variable $\mathbf{y}$ perturbed by Gaussian noise with zero mean and covariance $\boldsymbol\Sigma$.



Footnotes

...#tex2html_wrap_inline8849#3.1
In the literature, the function $S$ is often encoded with a scale factor $1/2$ to ensure that the gradient of $S$ is not biased by the factor $2$ and that the sign aligns with $f$ to simplify notation.


Subsections
Paolo medici
2025-10-22