Visual Odometry and Bundle Adjustment

Visual Odometry aims to determine the relative pose of a camera (or a stereo pair) moving through space by analyzing two sequential images. The problem of visual odometry for a single camera is typically solved by calculating the essential matrix and subsequently decomposing it. In this case, as previously mentioned, it is not possible to ascertain the scale of the motion, but only to relate the various movements to one another. The situation is different when a stereo pair is available.

Given a series of temporal observations of world points obtained from the three-dimensional reconstruction $(\mathbf{x}_i,\mathbf{x}'_i)$ , it is possible to linearly derive a rigid transformation $(\mathbf{R}, \mathbf{t})$ that transforms the world points at time $t$ to time $t'$ such that they can be expressed with an equation of the form:

$\begin{displaymath} \mathbf{x}'_i = \mathbf{R} \mathbf{x}_i + \mathbf{t} \end{displaymath}$

(9.85)

This approach is general and does not depend on the specific sensor used to obtain the points.

The rigid body transformation performed by the pair of sensors can be derived by minimizing the quantity:

$\begin{displaymath} \sum_i \Vert \mathbf{x}'_i - \mathbf{R} \mathbf{x}_i - \mathbf{t} \Vert^2 \end{displaymath}$

(9.86)

The 12-parameter solution, derived from overdetermined data, will find an absolute minimum but is not the optimal estimator, as it minimizes an algebraic quantity and does not guarantee that the rotation matrix is orthonormal. Starting from the linear solution, the use of a nonlinear minimizer (for example, Levenberg-Marquardt, see section 3.3.5) on the cost function given by equation (9.86) allows for a more precise determination of the 6 parameters (3 rotations and 3 translations). This algorithm is referred to as 3D-to-3D because it derives the motion from pairs of three-dimensional points. As an alternative to the linear solution, a closed-form solution is also possible (Hor87).

The approach presented now is general but poorly suited for the case of world points obtained from a three-dimensional reconstruction from images. The cost function shown indeed optimizes quantities in world coordinates rather than in image coordinates: the noise on the image points propagates non-linearly during the triangulation phase, and therefore it is only in image coordinates that one can assume the noise in point detection to be Gaussian with zero mean. It is thus not possible to create a maximum likelihood estimator using only the points in world coordinates. A more refined approach is the one referred to as 3D-to-2D, where the goal is to minimize the reprojection of a point from the past in image coordinates:

$\begin{displaymath} \sum_i \Vert \mathbf{p}_i - \hat{\mathbf{p}_i} \Vert^2 \end{displaymath}$

(9.87)

where $\hat{\mathbf{p}_1}$ is the rotated and translated projection of the three-dimensional point $\mathbf{x}_i$ obtained from the previous frame. This problem is also known as perspective from n points (PnP) as it is very similar to the previously discussed problem of camera calibration in a static environment.

Clearly, this approach is also affected by the fact that the three-dimensional point $\mathbf{x}_i$ is not a given of the problem but is known with a certain amount of error. For this reason, it is necessary to take an additional step by minimizing both errors in image coordinates (this is the Maximum Likelihood Estimation):

$\begin{displaymath} \sum_i \Vert \mathbf{p}_1 - \hat{\mathbf{p}_1} \Vert^2 + \Ve... ..._1 \Vert^2 + \Vert \mathbf{p}'_2 - \hat{\mathbf{p}}'_2 \Vert^2 \end{displaymath}$

(9.88)

with the constraints set as $\hat{\mathbf{p}_1} = \mathbf{K}_1\mathbf{R}_1 (\hat{\mathbf{x}}_i - \mathbf{t}_1)$ , $\hat{\mathbf{p}_2} = \mathbf{K}_2\mathbf{R}_2 (\hat{\mathbf{x}}_i - \mathbf{t}_2)$ , $\hat{\mathbf{p}'_1} = \mathbf{K}_1\mathbf{R}_1 (\hat{\mathbf{x}'}_i - \mathbf{t}_1)$ , and $\hat{\mathbf{p}'_2} = \mathbf{K}_2\mathbf{R}_2 (\hat{\mathbf{x}'}_i - \mathbf{t}_2)$ , to which the constraint from equation (9.85) is added, while keeping the unknown regarding the actual position of point $\hat{\mathbf{x}}_i$ in both reference frames. In this way, both the displacement performed by the cameras and the three-dimensional coordinate of each individual feature in the world are minimized. Even in this case, the solution to the maximum likelihood problem requires solving a nonlinear problem of considerable size. In the case of a rectified stereo pair, the cost function can be significantly simplified.

Visual odometry is a dead-reckoning algorithm and is therefore subject to drift. It is possible to extend these considerations to the case where multiple time instances are involved in the minimization process rather than just two. In this scenario, the discussion becomes complex as one attempts to minimize drift errors when composing the various transformations. A tutorial that addresses these topics is (SF11).

When addressing the problem from a Bayesian perspective, utilizing equation (9.88), and intending to process all frames simultaneously, the term Bundle Adjustment is preferred over visual odometry.

The concept of Bundle Adjustment, initially introduced in photogrammetry and later adopted by Computer Vision (see the excellent survey (TMHF00)), refers to a multivariable minimization aimed at simultaneously achieving a three-dimensional reconstruction, the relative poses of the cameras in a sequence of images, and potentially the intrinsic parameters of the cameras themselves.

This is an extension of the non-linear techniques that estimate parameters by minimizing a suitable cost function based on the reprojection errors of the identified points, in the same form as equation (9.88).

Since the same feature can be observed from different images, the estimation process conditions all poses, and consequently, the problem cannot be decomposed into $n$ separate visual odometry problems: all images in the sequence must be minimized simultaneously. For this reason, the Bundle Adjustment problem is a high-dimensional problem, certainly non-convex, which requires non-trivial optimization and employs sparse minimization to conserve memory and enhance accuracy.

An alternative approach to Bundle Adjustment, which is certainly not the best maximum likelihood estimator but introduces fewer unknowns, is Pose Graph Optimization (GKSB10). This method utilizes information from the same pose obtained from multiple paths, or by identifying Loops, allowing for the optimization of only the poses relative to those obtained from visual odometry. Let $\mathbf{x} = \left( \mathbf{x}_1, \ldots, \mathbf{x}_n \right)$ be a parameter vector where the element $\mathbf{x}_i$ represents the pose of the i-th node. Let $\mathbf{z}_{ij}$ and $\boldsymbol\Omega_{ij}$ denote the measurement and the precision matrix of the virtual observation of the relative pose between nodes i and j. The objective is to obtain an estimate of the parameters $\mathbf {x}$ given the virtual observations $\mathbf{z}_{ij}$ . Since the relative poses are obtained by comparing two absolute poses, the parameters to be estimated can be defined through the cost function

$\begin{displaymath} \mathbf{e}_{ij} = \mathbf{e}_{ij}\left( \mathbf{x}_i, \math... ...hat{\mathbf{z}}_{ij} \left( \mathbf{x}_i, \mathbf{x}_j \right) \end{displaymath}$

(9.89)

, which measures the error between the measured virtual relative pose $\mathbf{z}_{ij}$ and the predicted one $\hat{\mathbf{z}}_{ij} \left( \mathbf{x}_i, \mathbf{x}_j \right)$ given the configurations $\mathbf{x}_i$ and $\mathbf{x}_j$ to be evaluated for nodes i and j, respectively. By leveraging the information on the precision of the estimate of the individual relative pose, it is possible to define a global cost function

$\begin{displaymath} F(\mathbf{x}) = \sum_{< i,j >} \mathbf{e}^{\top}_{ij} \boldsymbol\Omega_{ij} \mathbf{e}_{ij} \end{displaymath}$

(9.90)

, which is essentially the sum of the individual Mahalanobis distances between all pairs $(i,j)$

for which a relative pose measurement has been made. The function $F(\mathbf{x})$ , minimized with respect to $\mathbf {x}$ , provides the best estimate of the absolute poses of the problem, all without involving the individual elements that comprise the single real observation.

Paolo medici
2025-10-22