Descriptors

Another concept that has a cross-cutting relevance among the themes of computer vision is that of descriptor (Visual Descriptor). The descriptor is indeed utilized in various contexts: it is employed to perform comparisons between characteristic points or to generate the disparity map in stereoscopic vision, to provide a compact representation of a portion of the image to expedite its identification or retrieval, and due to this compact solution that preserves a significant amount of information, it is used to generate the feature space in classification algorithms.

Depending on the transformation that the image undergoes from which the points are to be characterized, the descriptor must satisfy certain invariance principles:

translation
This is the simplest case and is automatically resolved by the feature point extractor;
scale
This is another transformation that is typically handled by the feature point extractor;
brightness
Images may experience variations in brightness;
rotation
Images may represent the same scene rotated;
perspective
Changes in perspective complexly distort the observed portion of the world.

Before the introduction of the compact descriptor concept, the universally adopted method for comparing two feature points was the correlation between the areas surrounding the point:

\begin{displaymath}
d(\mathbf{p}_1,\mathbf{p}_2)=\sum_{\boldsymbol\delta \in \O...
...\bar {I_1})(I_2(\mathbf{p}_2 + \boldsymbol\delta) - \bar{I_2})
\end{displaymath} (6.1)

with $\Omega$ a fixed-size window centered on the point in the two images and $\bar{I}_n$ the average value of the image within the window $\Omega$. $w_{\delta}$ is an optional weight (for example, a Gaussian) to assign different contributions to pixels that are near or far from the point. The correlation is invariant to changes in brightness but requires a high computational cost. In this case, the descriptor is precisely the portion of the image surrounding the identified point (Mor80).

A similar approach to correlation, which is not invariant to brightness but is more computationally efficient, is the SAD (Sum of Absolute Differences):

\begin{displaymath}
d(\mathbf{p}_1,\mathbf{p}_2)=\sum_{\boldsymbol\delta \in \O...
...oldsymbol\delta) - I_2(\mathbf{p}_2 + \boldsymbol\delta) \vert
\end{displaymath} (6.2)

To make the SAD invariant to brightness, comparisons are typically performed not on the original image, but on the horizontal and vertical derivative images. This reasoning appears quite straightforward but can be further generalized to the concept of performing comparisons not on the original image, but between one or more images extracted using different kernels, which provide the descriptor with certain levels of invariance.

It is also worth noting that the comparison of pixels between images is nonetheless an algorithm of type $O(n^2)$: performing these comparisons point by point still requires significant computational weight and multiple memory accesses. Modern solutions aim to overcome this limitation by proposing the extraction of a descriptor from the neighborhood of the point, which is smaller in size than the number of represented pixels, yet maximizes the information contained within it.

Both SIFT (section 5.3) and SURF (section 5.4) extract their descriptors by leveraging scale and rotation information derived from the image (it is possible to extract this information independently, and therefore it can be applied to any class of descriptors to make them invariant to scale and rotation). The descriptors obtained from SIFT and SURF are different versions of the same concept, namely the histogram of oriented gradients (section 6.2), which serves as an example of how to compress the variability around a point into a reduced-dimensional space.

All currently used descriptors do not directly utilize the image points as descriptors, but it is easy to see that a sufficiently well-distributed subset of points is enough to achieve an accurate description of the point. In (RD05), a descriptor is created using the 16 pixels located along the discrete circumference of radius 3. This description can be made even more compact by transitioning to the binary form of the Local Binary Patterns described later, or by not being constrained to the circumference, as in Census or BRIEF.

Another approach is to appropriately sample the kernel space (GZS11), extracting from $m$ coordinates around the keypoint the values obtained from convolutions of the original image (horizontal and vertical Sobel), in order to create a descriptor consisting of just $2m$ values.

It is noteworthy that, for purely computational reasons related to resource reuse, a specific descriptor extractor is often associated with each particular feature point extractor.

From this introduction, it is clear that describing a key point with a smaller yet sufficiently descriptive dataset is a useful approach, especially in the context of classification. The concept of a descriptor arises from the attempt to extract local information from the image that allows for the preservation of a significant portion of the information. In this way, it becomes possible to perform (relatively) fast comparisons between points across images or to use such descriptors as features for training classifiers.



Subsections
Paolo medici
2025-10-22