SURF

The Speeded Up Robust Features algorithm (BETVG08) is inspired by the SIFT algorithm and the scale-space representation theory, proposing an optimized version that utilizes approximate Hessians by employing the integral image, both for detecting keypoints and for extracting their descriptors.

SURF is invariant to translation, scale, and rotation, but there exists a simplified variant, referred to as "U-SURF," which is only invariant to variations in translation and scale. In this case, the area around the identified point is not normalized with respect to rotation when the descriptor is extracted.

In SURF, the characteristic points are detected by calculating local maxima on the determinant of the Hessian image defined as:

\begin{displaymath}
\mathcal{H}(x,y; t)=\begin{bmatrix}
\frac{\partial}{\parti...
...in{bmatrix}D_{xx} & D_{xy} \\
D_{xy} & D_{yy}
\end{bmatrix}\end{displaymath} (5.12)

an image formed by the convolutions of the second-order derivatives of the Gaussian with variance $t=\sigma^2$ and the image at point $(x,y)$. For performance reasons, the derivatives of the Gaussians are quantized to integer values and approximated using rectangular regions (box filters), meaning that certain rectangular areas around the point are positively weighted, while others are negatively weighted, and their sum forms the element of the matrix $\mathcal{H}$.

The bandwidth of these approximate filters can be estimated as

\begin{displaymath}
\sigma = \frac{1.2}{9} l
\end{displaymath} (5.13)

with $l$ being the size of the filter. The filter $9 \times 9$, the smallest possible, for instance, approximates the derivatives of the Gaussian with variance $\sigma=1.2$.

The determinant image is calculated as

\begin{displaymath}
\det(\mathcal{H}) = D_{xx} D_{yy} - \left(w D_{xy} \right)^{2}
\end{displaymath} (5.14)

where $w$ is a factor that accounts for quantization, attempting to compensate for various rounding errors, and is typically set to $w=0.912$ constant. Finally, the determinant is normalized with respect to the scale size involved, allowing for comparisons across different scales.

The image is analyzed across multiple octaves (each octave has a scale factor that is double that of the previous octave). Each octave is divided into an equal number of scale levels. The number of scales per octave is constrained by the inherently quantized nature of the filter, and the approximated Gaussians are not as evenly spaced as in the case of SIFT. In fact, 4 intervals per octave is the only feasible number of subdivisions.

Within each octave, as the scale $s$ and position vary, a Non-Maxima Suppression $3\times 3\times 3$ is performed on the determinant image of $\mathcal{H}$. The local minima/maxima, interpolated through a three-dimensional quadratic as in SIFT, are the interest points identified by SURF. The scale is set equal to the variance of the associated filter $s=\sigma$.

From the identified maxima, using the integral image, the dominant orientation is extracted in the vicinity of the point (within a radius of $6s$ and sampled at a step of $s$). In this case, Haar features of size $4s$ are also utilized and weighted with a Gaussian distribution of $\sigma=2s$.

Through the orientation information, a descriptor is generated based on the directions of the gradients by sampling the area around $20s$, divided into $4 \times 4$ regions and weighting the points with a Gaussian $\sigma=3.3s$. Within each region, $d_x$, $d_y$, $\vert d_x\vert$, and $\vert d_y\vert$ are calculated. Both the orientation and the gradient histogram are extracted at the detection scale of the feature.

Paolo medici
2025-10-22