The Maximum Likelihood Estimator

From a statistical perspective, the data vector $\mathbf{x} = \left\{ x_1 \ldots x_n \right\}$ consists of realizations of a random variable from an unknown population. The task of data analysis is to identify the population that most likely generated those samples. In statistics, each population is characterized by a corresponding probability distribution, and associated with each probability distribution is a unique parameterization $\boldsymbol\vartheta$: by varying these parameters, a different probability distribution should be generated.

Let $f( \mathbf{x} \vert \boldsymbol\vartheta)$ be the probability density function (PDF) that indicates the probability of observing $\mathbf {x}$ given a parameterization $\boldsymbol\vartheta$. If the individual observations $x_i$ are statistically independent of each other, the PDF of $\mathbf {x}$ can be expressed as the product of the individual PDFs:

\begin{displaymath}
f( \mathbf{x} = \left\{ x_1 \ldots x_n \right\} \vert \bold...
...oldsymbol\vartheta) \ldots f_n(x_n \vert \boldsymbol\vartheta)
\end{displaymath} (2.48)

Given a parameterization $\boldsymbol\vartheta$, it is possible to define a specific PDF that illustrates the probability of the occurrence of certain data relative to others. In the real case, we face the exact reciprocal problem: the data have been observed, and we need to identify which $\boldsymbol\vartheta$ generated that specific PDF.

Definizione 7   To solve the inverse problem, we define the function $\mathcal{L}: \boldsymbol\vartheta \mapsto [0, \infty)$, the likelihood function, defined as
\begin{displaymath}
\mathcal{L}(\boldsymbol\vartheta \vert \mathbf{x} ) = f (\m...
...rtheta) = \prod_{i=1}^{n} f_i (x_i \vert \boldsymbol\vartheta)
\end{displaymath} (2.49)

in the case of statistically independent observations.

$\mathcal{L}( \boldsymbol\vartheta \vert \mathbf{x} )$ indicates the likelihood of the parameter $\boldsymbol\vartheta$ following the observation of the events $\mathbf {x}$.

The principle of the maximum likelihood estimator (MLE) $\hat{\boldsymbol\vartheta}_{MLE}$, originally developed by R.A. Fisher in the 1920s, selects the best parameterization that best fits the probability distribution generated by the observed data.

In the case of a Gaussian probability distribution, an additional definition is useful.

Definizione 8   Let $\ell$ be the log-likelihood function defined as
\begin{displaymath}
\ell = \log \mathcal{L}(\boldsymbol\vartheta \vert x_1 \ldo...
..._n) = \sum_{i=1}^{n} \log f_i (x_i \vert \boldsymbol\vartheta)
\end{displaymath} (2.50)

having utilized the properties of the logarithm.

The best estimate of the model parameters is the one that maximizes the likelihood, specifically the log-likelihood

\begin{displaymath}
\hat{\boldsymbol\vartheta}_{ML} = \argmax_{\boldsymbol\vart...
...heta} \sum_{i=1}^{n} \log f_i (x_i \vert \boldsymbol\vartheta)
\end{displaymath} (2.51)

since the logarithm is a monotonically increasing function.

It is possible to find in the literature, as an optimal estimator, instead of maximizing the likelihood function, the minimization of its opposite

\begin{displaymath}
\hat{\boldsymbol\vartheta}_{ML} = \argmin_{\boldsymbol\vart...
...um_{i=1}^{n} \log f_i (x_i \vert \boldsymbol\vartheta) \right)
\end{displaymath} (2.52)

, that is, the minimization of the negative log likelihood.

This formulation is particularly useful when the noise distribution is Gaussian. Let $(x_i,y_i)$ be the realizations of the random variable. In the case of a generic function $y_i = g(x_i ; \boldsymbol\vartheta) + \epsilon$ with normally distributed noise, constant time, and zero mean, the likelihood is given by

\begin{displaymath}
\mathcal{L}(\boldsymbol\vartheta \vert \mathbf{x} ) = \prod...
...- g(x_i; \boldsymbol\vartheta ) \right)^2}{2 \sigma^2} \right)
\end{displaymath} (2.53)

. Therefore, the maximum likelihood estimate (MLE) obtained by minimizing the negative log likelihood can be expressed as
\begin{displaymath}
\hat{\boldsymbol\vartheta}_{ML} = \argmin_{\boldsymbol\vart...
...=1}^{n} \left( y_i - g(x_i ; \boldsymbol\vartheta ) \right)^2
\end{displaymath} (2.54)

. In other words, the traditional least squares solution is the maximum likelihood estimator in the case of additive Gaussian noise with zero mean.

Now, the $m$ partial derivatives of the log-likelihood form a vector $m \times 1$.

\begin{displaymath}
\mathbf{u}(\boldsymbol\beta) = \dfrac{\partial \ell(\boldsy...
...dots \\ \dfrac{\partial \ell} {\partial \beta_m}
\end{bmatrix}\end{displaymath} (2.55)

The vector $\mathbf{u}(\boldsymbol\beta)$ is referred to as the score vector (or Fisher's score function) of the log-likelihood. If the log-likelihood is concave, the maximum likelihood estimator thus identifies the point for which
\begin{displaymath}
\mathbf{u}( \hat{ \boldsymbol\beta } ) = \mathbf{0}
\end{displaymath} (2.56)

The moments of $\mathbf{u}(\boldsymbol\beta)$ therefore satisfy important properties: as we noted earlier, the mean of $\mathbf{u}(\boldsymbol\beta)$ calculated at the maximum likelihood point is equal to zero, and the variance-covariance matrix is
\begin{displaymath}
\var \left( \mathbf{u}(\boldsymbol\beta) \right) = \E \left[...
...ta_j \partial \beta_k} \right] = \mathcal{I}(\boldsymbol\beta)
\end{displaymath} (2.57)

The matrix $\mathcal{I}$, defined as the negative of the Hessian, is referred to as the expected Fisher information matrix, and its inverse is known as the observed information matrix.



Subsections
Paolo medici
2025-10-22