Let's start by looking at the most common situation in real-world applications: when the measurements we collect are affected by random noise that follows a Gaussian (bell-shaped) distribution and is added to the data.
We can describe this with the equation:
To estimate the parameters reliably, we need a large number of input samples, written as
. Ideally, the number of samples should be much greater than the number of parameters we are trying to estimate.
One can consider that the function of the parameters may not be the same for all samples, but there could be different functions, in fact observing different quantities, always as a function of the same parameters
. In this case, the equation (3.2) can be generalized as follows:
The vector is introduced, defined as
| (3.4) |
To obtain a maximum likelihood estimator, the quantity to minimize is the negative log likelihood (section 2.8) of the function (3.2). In the case of Gaussian noise, the likelihood function can be expressed as
Least squares regression is a standard optimization technique for overdetermined systems that identifies the parameters
of a function
that minimize an error
calculated as the sum of squares (Sum Of Squared Error) of the residuals
over a set of
observations
:
is a function that is analyzed by varying the parameters
to find its minimum value.
A global minimum is challenging to identify from a purely computational standpoint, and typically, techniques can only be employed to locate local minima.
Let us therefore consider
3.1 differentiable, that is,
differentiable.
The necessary condition for
to be a minimum is that, at that point in the parameter space, the gradient of
vanishes, that is,
A sufficient condition for a stationary point (
) to be a minimum is that
(the Hessian) is positive definite. Clearly, the existence of a local minimum guarantees only that there exists a neighborhood
of
such that the function
.
All the discussions addressed so far assume that the noise is additive with constant variance across all samples (homoscedasticity). In the case where the measurement noise is still Gaussian and additive with a mean of zero but with non-constant variance, each individual observation
is an independent random variable associated with the variance
. Intuitively, it is understood that the optimal regression in this case will need to give more weight to samples with low variance while giving less weight to samples with high variance.
To achieve this result, a normalization is employed, similar to that shown in section 2.4.1 and a direct consequence of the likelihood in equation (3.5). Therefore, one should no longer minimize the simple sum of squared residuals, but rather the weighted sum of the residuals:
| (3.9) |
The cost function, now the sum of a random variable with unit variance squared, becomes a chi-squared distribution and is thus referred to as . The minimum of this cost function coincides with that obtained previously from the least squares when the variance is constant instead. The condition (3.8) for obtaining the minimum is also modified accordingly:
Further generalizing this concept, when the observation is subject to Gaussian noise with a known covariance matrix
, the Weighted Sum of Squared Error (WSSE) can ultimately be expressed as
It seems that you have entered a placeholder or a command related to a LaTeX environment. If you have specific content or text that you would like me to translate from Italian to English, please provide that text, and I will be happy to assist you!
It is noteworthy that this formulation of the cost function is equivalent to that of equation (3.6), where, however, the Mahalanobis distance is used instead of the Euclidean distance (section 2.4).
Any Weighted Least Squares can be reduced to an unweighted problem
by premultiplying the residuals
(and consequently the derivatives) by a matrix
such that
, using, for instance, a Cholesky decomposition in the case where this matrix is not diagonal.
All these estimators, which take into account the variance of the observation, coincide with the negative log likelihood for the variable perturbed by Gaussian noise with zero mean and covariance
.