The Maximum Likelihood Estimator

From a statistical perspective, the data vector $\mathbf{x} = \left\{ x_1 \ldots x_n \right\}$ consists of realizations of a random variable from an unknown population. The task of data analysis is to identify the population that most likely generated those samples. In statistics, each population is characterized by a corresponding probability distribution, and associated with each probability distribution is a unique parameterization $\boldsymbol\vartheta$ : by varying these parameters, a different probability distribution should be generated.

Let $f( \mathbf{x} \vert \boldsymbol\vartheta)$ be the probability density function (PDF) that indicates the probability of observing $\mathbf {x}$ given a parameterization $\boldsymbol\vartheta$ . If the individual observations $x_i$ are statistically independent of each other, the PDF of $\mathbf {x}$ can be expressed as the product of the individual PDFs:

$\begin{displaymath} f( \mathbf{x} = \left\{ x_1 \ldots x_n \right\} \vert \bold... ...oldsymbol\vartheta) \ldots f_n(x_n \vert \boldsymbol\vartheta) \end{displaymath}$

(2.48)

Given a parameterization $\boldsymbol\vartheta$ , it is possible to define a specific PDF that illustrates the probability of the occurrence of certain data relative to others. In the real case, we face the exact reciprocal problem: the data have been observed, and we need to identify which $\boldsymbol\vartheta$ generated that specific PDF.

Definizione 7 To solve the inverse problem, we define the function $\mathcal{L}: \boldsymbol\vartheta \mapsto [0, \infty)$ , the likelihood function, defined as

$\begin{displaymath} \mathcal{L}(\boldsymbol\vartheta \vert \mathbf{x} ) = f (\m... ...rtheta) = \prod_{i=1}^{n} f_i (x_i \vert \boldsymbol\vartheta) \end{displaymath}$

(2.49)

in the case of statistically independent observations.

$\mathcal{L}( \boldsymbol\vartheta \vert \mathbf{x} )$ indicates the likelihood of the parameter $\boldsymbol\vartheta$ following the observation of the events $\mathbf {x}$ .

The principle of the maximum likelihood estimator (MLE) $\hat{\boldsymbol\vartheta}_{MLE}$ , originally developed by R.A. Fisher in the 1920s, selects the best parameterization that best fits the probability distribution generated by the observed data.

In the case of a Gaussian probability distribution, an additional definition is useful.

Definizione 8 Let $\ell$ be the log-likelihood function defined as

$\begin{displaymath} \ell = \log \mathcal{L}(\boldsymbol\vartheta \vert x_1 \ldo... ..._n) = \sum_{i=1}^{n} \log f_i (x_i \vert \boldsymbol\vartheta) \end{displaymath}$

(2.50)

having utilized the properties of the logarithm.

The best estimate of the model parameters is the one that maximizes the likelihood, specifically the log-likelihood

$\begin{displaymath} \hat{\boldsymbol\vartheta}_{ML} = \argmax_{\boldsymbol\vart... ...heta} \sum_{i=1}^{n} \log f_i (x_i \vert \boldsymbol\vartheta) \end{displaymath}$

(2.51)

since the logarithm is a monotonically increasing function.

It is possible to find in the literature, as an optimal estimator, instead of maximizing the likelihood function, the minimization of its opposite

$\begin{displaymath} \hat{\boldsymbol\vartheta}_{ML} = \argmin_{\boldsymbol\vart... ...um_{i=1}^{n} \log f_i (x_i \vert \boldsymbol\vartheta) \right) \end{displaymath}$

(2.52)

, that is, the minimization of the negative log likelihood.

This formulation is particularly useful when the noise distribution is Gaussian. Let $(x_i,y_i)$ be the realizations of the random variable. In the case of a generic function $y_i = g(x_i ; \boldsymbol\vartheta) + \epsilon$ with normally distributed noise, constant time, and zero mean, the likelihood is given by

$\begin{displaymath} \mathcal{L}(\boldsymbol\vartheta \vert \mathbf{x} ) = \prod... ...- g(x_i; \boldsymbol\vartheta ) \right)^2}{2 \sigma^2} \right) \end{displaymath}$

(2.53)

. Therefore, the maximum likelihood estimate (MLE) obtained by minimizing the negative log likelihood can be expressed as

$\begin{displaymath} \hat{\boldsymbol\vartheta}_{ML} = \argmin_{\boldsymbol\vart... ...=1}^{n} \left( y_i - g(x_i ; \boldsymbol\vartheta ) \right)^2 \end{displaymath}$

(2.54)

. In other words, the traditional least squares solution is the maximum likelihood estimator in the case of additive Gaussian noise with zero mean.

Now, the $m$ partial derivatives of the log-likelihood form a vector $m \times 1$ .

$\begin{displaymath} \mathbf{u}(\boldsymbol\beta) = \dfrac{\partial \ell(\boldsy... ...dots \\ \dfrac{\partial \ell} {\partial \beta_m} \end{bmatrix}\end{displaymath}$

(2.55)

The vector $\mathbf{u}(\boldsymbol\beta)$ is referred to as the score vector (or Fisher's score function) of the log-likelihood. If the log-likelihood is concave, the maximum likelihood estimator thus identifies the point for which

$\begin{displaymath} \mathbf{u}( \hat{ \boldsymbol\beta } ) = \mathbf{0} \end{displaymath}$

(2.56)

The moments of $\mathbf{u}(\boldsymbol\beta)$ therefore satisfy important properties: as we noted earlier, the mean of $\mathbf{u}(\boldsymbol\beta)$ calculated at the maximum likelihood point is equal to zero, and the variance-covariance matrix is

$\begin{displaymath} \var \left( \mathbf{u}(\boldsymbol\beta) \right) = \E \left[... ...ta_j \partial \beta_k} \right] = \mathcal{I}(\boldsymbol\beta) \end{displaymath}$

(2.57)

The matrix $\mathcal{I}$ , defined as the negative of the Hessian, is referred to as the expected Fisher information matrix, and its inverse is known as the observed information matrix.

Subsections

Maximum a Posteriori Estimation

Paolo medici
2025-10-22