Visual Odometry aims to determine the relative pose of a camera (or a stereo pair) moving through space by analyzing two sequential images. The problem of visual odometry for a single camera is typically solved by calculating the essential matrix and subsequently decomposing it. In this case, as previously mentioned, it is not possible to ascertain the scale of the motion, but only to relate the various movements to one another. The situation is different when a stereo pair is available.
Given a series of temporal observations of world points obtained from the three-dimensional reconstruction
, it is possible to linearly derive a rigid transformation
that transforms the world points at time
to time
such that they can be expressed with an equation of the form:
The rigid body transformation performed by the pair of sensors can be derived by minimizing the quantity:
The approach presented now is general but poorly suited for the case of world points obtained from a three-dimensional reconstruction from images.
The cost function shown indeed optimizes quantities in world coordinates rather than in image coordinates: the noise on the image points propagates non-linearly during the triangulation phase, and therefore it is only in image coordinates that one can assume the noise in point detection to be Gaussian with zero mean. It is thus not possible to create a maximum likelihood estimator using only the points in world coordinates. A more refined approach is the one referred to as 3D-to-2D, where the goal is to minimize the reprojection of a point from the past in image coordinates:
| (9.87) |
Clearly, this approach is also affected by the fact that the three-dimensional point is not a given of the problem but is known with a certain amount of error. For this reason, it is necessary to take an additional step by minimizing both errors in image coordinates (this is the Maximum Likelihood Estimation):
Visual odometry is a dead-reckoning algorithm and is therefore subject to drift. It is possible to extend these considerations to the case where multiple time instances are involved in the minimization process rather than just two. In this scenario, the discussion becomes complex as one attempts to minimize drift errors when composing the various transformations. A tutorial that addresses these topics is (SF11).
When addressing the problem from a Bayesian perspective, utilizing equation (9.88), and intending to process all frames simultaneously, the term Bundle Adjustment is preferred over visual odometry.
The concept of Bundle Adjustment, initially introduced in photogrammetry and later adopted by Computer Vision (see the excellent survey (TMHF00)), refers to a multivariable minimization aimed at simultaneously achieving a three-dimensional reconstruction, the relative poses of the cameras in a sequence of images, and potentially the intrinsic parameters of the cameras themselves.
This is an extension of the non-linear techniques that estimate parameters by minimizing a suitable cost function based on the reprojection errors of the identified points, in the same form as equation (9.88).
Since the same feature can be observed from different images, the estimation process conditions all poses, and consequently, the problem cannot be decomposed into separate visual odometry problems: all images in the sequence must be minimized simultaneously. For this reason, the Bundle Adjustment problem is a high-dimensional problem, certainly non-convex, which requires non-trivial optimization and employs sparse minimization to conserve memory and enhance accuracy.
An alternative approach to Bundle Adjustment, which is certainly not the best maximum likelihood estimator but introduces fewer unknowns, is Pose Graph Optimization (GKSB10). This method utilizes information from the same pose obtained from multiple paths, or by identifying Loops, allowing for the optimization of only the poses relative to those obtained from visual odometry. Let
be a parameter vector where the element
represents the pose of the i-th node. Let
and
denote the measurement and the precision matrix of the virtual observation of the relative pose between nodes i and j. The objective is to obtain an estimate of the parameters
given the virtual observations
. Since the relative poses are obtained by comparing two absolute poses, the parameters to be estimated can be defined through the cost function
| (9.89) |
| (9.90) |
Paolo medici