Overdetermined linear systems

Analyzing real systems, it is easy to encounter the problem of deriving the "solution" of an overdetermined linear system.

The importance of this topic is evident: when observations are made on a real system, it is naturally affected by observation noise. This noise clearly compromises the result of the single observation, but fortunately, it is usually possible to acquire many more observations than unknowns, thus obtaining an overdetermined system.

Under these conditions, to obtain a solution to the problem that minimizes the error, the use of a numerical regression technique is required, for example, least squares. In this first section, widely used mathematical techniques throughout the book are presented: for further details regarding these techniques, one can refer to Chapter 3 which is entirely focused on this subject.

We therefore have an overdetermined linear system

\begin{displaymath}
\mathbf{A} \mathbf{x}= \mathbf{y}
\end{displaymath} (1.1)

where $\mathbf{A}$ is a rectangular matrix $m \times n$ and with $m \geq n$. As this matrix is rectangular, it does not have an inverse; however, it is still possible to define for every possible solution $\mathbf{x} \in \mathbb{R}^{n}$ a value of the error, also known as the residual, that this potential solution would entail. There is no general solution for an overdetermined system, but only solutions that minimize the residual under a particular metric.

We define1.1 the error metric as the norm of the residual

\begin{displaymath}
\epsilon(\mathbf{x}) = \left\Vert \mathbf{A} \mathbf{x} - \mathbf{y} \right \Vert^{2}
\end{displaymath} (1.2)

The so-called “least squares” solution of a linear system (1.1) is represented by the vector $\mathbf {x}$ that minimizes the Euclidean distance of the residual (1.2), meaning that finding the optimal solution of the system, in terms of least squares regression, is equivalent to locating the minimum of this error function with respect to $\mathbf {x}$.

If the equation (1.1) is multiplied by $\mathbf{A}^{\top}$, a "traditional" linear system is obtained, which has the solution

\begin{displaymath}~
\mathbf{A}^{\top}\mathbf{A} \mathbf{x} = \mathbf{A}^{\top} \mathbf{y}
\end{displaymath} (1.3)

This is a first solution method for overdetermined systems and is referred to in the literature as the technique of normal equations: the original problem is transformed into a classical linear system where the coefficient matrix is square and therefore invertible using classical techniques.

However, the solution proposed in equation (1.3) is numerically unstable since $\mbox{cond} \approx \mathbf{A}^2$. Further details on the conditioning of matrices and the propagation of disturbances in the solution of well-posed or over-determined linear systems will be presented in section 2.7.

If the system is well-conditioned, the most stable technique for solving a problem involving the normal equations is Cholesky factorization.

It can be shown that a solution $\mathbf {x}$, better conditioned and that minimizes the function (1.2), exists and is given by:

\begin{displaymath}
\begin{array}{rcl}
\mathbf{A} \mathbf{x} & = & \mathbf{y} \\...
...thbf{A}\right)^{-1}\mathbf{A}^{\top} \mathbf{y} \\
\end{array}\end{displaymath} (1.4)

By construction, $\mathbf {x}$ is a solution to the system (1.1) and is also the vector that minimizes the function (1.2). The pseudoinverse matrix of $\mathbf{A}$ is denoted as $\mathbf{A}^{+}$ and is given by

\begin{displaymath}
\mathbf{A}^{+} = \left(\mathbf{A}^{\top}\mathbf{A}\right)^{-1}\mathbf{A}^{\top}
\end{displaymath} (1.5)

This solution of the system is called the Moore-Penrose pseudoinverse.

The pseudoinverse has the following properties:

It is important to clarify from the outset that in minimizing the quantity (1.2), no assumptions have been made regarding the distribution of noise within the various components of which the matrix is composed: without such information, there is no guarantee that the solution will be optimal from a statistical point of view. Without assumptions about the noise distribution, the solution obtained through this minimization is indeed a purely algebraic solution that minimizes an algebraic error.

It is possible to obtain a slightly better solution from a statistical perspective when the noise is zero-mean white Gaussian and the variance of the noise for each observation is known. In this case, it is possible to assign different weights to each equation of the system by multiplying each row of the system by an appropriate weight, thus weighting each acquired data point differently.

A more in-depth discussion on this topic can be found in section 3.2, and in general, in chapter 2, the general case will be addressed where the way in which the error in the observed data affects the estimation of the parameters is known.

Stable techniques exist instead, based on factorizations that allow for deriving the solution directly from the matrix $\mathbf{A}$.

Using, for example, QR factorization, a method that is well-known for its numerical stability, the original problem (1.1) transforms into the problem $\mathbf{Q}\mathbf{R} \mathbf{x} = \mathbf{y}$, and the solution can be derived from $\mathbf{R}\mathbf{x} = \mathbf{Q}^{\top} \mathbf{y}$, leveraging the orthogonality of matrix $\mathbf{Q}$. In QR factorization, the relationship $\mathbf{R}^{\top}\mathbf{R}=\mathbf{A}^{\top}\mathbf{A}$ holds, meaning that $\mathbf{R}$ is the Cholesky factorization of $\mathbf{A}^{\top}\mathbf{A}$: through this relationship, the pseudoinverse can ultimately be obtained explicitly.

On the other hand, using the Singular Value Decomposition Singular Value Decomposition (SVD), the oversized matrix $\mathbf{A}$ is decomposed into three matrices with interesting properties. Let $\mathbf{A}=\mathbf{U} \mathbf{S} \mathbf{V}^{*}$ be the singular value decomposition (SVD) of $\mathbf{A}$. $\mathbf{U}$ is a unitary matrix of dimensions $m \times n$ (depending on the formalism used, complete SVD or economic SVD, the dimensions of the matrices may vary, and $\mathbf{U}$ may become $m \times m$), $\mathbf{S}$ is a diagonal matrix that contains the singular values (the eigenvalues of matrix $\mathbf{A} \mathbf{A}^\top$, with dimensions, depending on the formalism, $n \times n$ or $m \times n$), and $\mathbf{V}^{*}$ is an orthonormal matrix, conjugate transposed, of dimensions $n \times n$.

Through a purely mathematical procedure, it is obtained that the pseudoinverse of $\mathbf{A}$ is equivalent to

\begin{displaymath}
\mathbf{A}^{+}=\mathbf{V} \mathbf{S}^{+} \mathbf{U}^{*}
\end{displaymath} (1.6)

where the pseudoinverse of a diagonal matrix $\mathbf{S}^{+}$ is equivalent to its inverse, that is, a diagonal matrix composed of the reciprocals of the corresponding values.

In summary, the ways to solve an oversized linear system are

From a numerical computing perspective, the condition number of the matrix to be inverted can be reduced by scaling the columns of A.

Further details on the Moore-Penrose pseudoinverse can be found in many books, for example in (CM09) or in the foundational text on numerical analysis (GVL96).

Let us now examine the case in which the linear system to be solved is homogeneous.

A homogeneous linear system has the form


\begin{displaymath}
\mathbf{A}\mathbf{x}=0
\end{displaymath} (1.7)

and typically the obvious solution $\mathbf{x}=0$, which is obtained through equation (1.4), is not useful for the problem at hand. In this case, it is necessary to find, still under the terms of a least squares regression, a $\mathbf{x} \in \mathbb{R}^{n}$, which is non-zero, representing a vector subspace, specifically the kernel of $\mathbf{A}$. The generating vector of the subspace is known up to one or more multiplicative factors (depending on the dimension of the null space). To obtain a unique solution, an additional constraint must be imposed, for instance $\vert \mathbf{x} \vert = 1$, allowing us to formalize
\begin{displaymath}
\hat{\mathbf{x}} = \argmin_{\mathbf{x}} \Vert \mathbf{A}\mathbf{x} \Vert^{2}
\end{displaymath} (1.8)

with the constraint $\vert \mathbf{x} \vert = 1$, thus performing a constrained minimization.

In this case, the SVD proves to be an extremely effective and computationally stable technique: the bases of the kernel of $\mathbf{A}$ are indeed exactly the columns of $\mathbf{V}$ associated with the zero singular values (eigenvalues) of the diagonal matrix $\mathbf{S}$. Generally, due to the presence of noise, there will not be an exactly zero singular value, but the column associated with the minimum singular value must be chosen.

The eigenvectors associated with null singular values of the matrix $\mathbf{S}$ thus represent the kernel of the matrix itself, and the number of null eigenvalues represents the dimension of the kernel. It should be noted that in equation (1.6), the presence of zeros in the diagonal matrix $\mathbf{S}$ was problematic: it is now understood that this presence is indicative of the fact that one of the components of the problem is completely uncorrelated with the solution and, as such, could be neglected. This result will indeed be used in section 2.10.1 in the discussion of the PCA algorithm.

The solution of the subspace of $\mathbf{A}$ is therefore

\begin{displaymath}
\mathbf{x} = \sum_{i=1}^{N}{\beta_i \mathbf{v}_i}
\end{displaymath} (1.9)

where $\mathbf{v}_i$ are the columns of the matrix $\mathbf{V}$, "right" singular vectors of $\mathbf{A}$ corresponding to $N$ singular values, eigenvalues of 0 of the matrix $\mathbf{A}^{\top}\mathbf{A}$.

The SVD decomposition is one of the most stable and versatile techniques developed in recent years for solving linear systems, and throughout this book, this technology will be widely used.



Footnotes

... define1.1
the motivation for this choice will be discussed in detail in the chapter on model analysis.
Paolo medici
2025-10-22