Classification

A predominant role in Computer Vision is played by Classification techniques and machine learning. The vast amount of information that can be extracted from a video sensor far exceeds the quantity that can be obtained from other sensors; however, it requires complex techniques that enable the exploitation of this wealth of information.

As previously mentioned, statistics, classification, and model fitting can be viewed as different facets of a single topic. Statistics seeks the most accurate Bayesian approach to extract the hidden parameters (of the state or model) of a system, potentially affected by noise, aiming to return the most probable output given the inputs. Meanwhile, classification offers techniques and methods for efficiently modeling the system. Finally, if the exact model underlying a physical system were known, any classification problem would reduce to an optimization problem. For these reasons, it is not straightforward to delineate where one topic ends and another begins.

The classification problem can be reduced to the task of deriving the parameters of a generic model that allows for generalization of the problem, given a limited number of examples.

A classifier can be viewed in two ways, depending on the type of information that the system aims to provide:

as a "likelihood" function towards a model, as in equation (4.1);
as a partitioning of the input space, as in equation (4.2).

In the first case, a classifier is represented by a generic function

$\begin{displaymath} f: \mathbb{R}^{n} \to \mathbb{R}^{m} \end{displaymath}$

(4.1)

that allows associating with the input element $\mathbf {x}$ , composed of $n$

characteristics representing the example to be classified, confidence values regarding the possible $\{y_1,\ldots,y_m\}$ output classes (categories):

$\begin{displaymath} f(\mathbf{x}) = \left( p(y_1\vert\mathbf{x}), \ldots, p(y_m\vert\mathbf{x}) \right) \end{displaymath}$

that is, the probability that the observed object is precisely $y_i$

given the observed quantity $\mathbf {x}$ .

Due to the infinite nature of possible functions and the lack of additional specific information regarding the problem's structure, the function $f$ cannot be a well-defined function but will be represented by a parameterized model in the form

$\begin{displaymath} \mathbf{y} = f(\mathbf{x}, \boldsymbol\beta) \end{displaymath}$

where $\mathbf{y} \in \mathbb{R}^{m}$ is the output space, $\mathbf{x} \in \mathbb{R}^{n}$ is the input space, and $\boldsymbol\beta$ are the model parameters $f$

to be determined during the training phase.

The training phase is based on a set of examples (training set) consisting of pairs $(\mathbf{x}_i, \mathbf{y}_i)$ , and through these examples, the training phase must determine the parameters $\boldsymbol\beta$ of the function $f(\mathbf{x}, \boldsymbol\beta)$ that minimize, under a specific metric (cost function), the error on the training set itself.

To train the classifier, it is therefore necessary to identify the optimal parameters $\boldsymbol\beta$ that minimize the error in the output space: classification is also an optimization problem. For this reason, machine learning, model fitting, and statistics are closely related research areas. The same considerations applied in Kalman or for Hough, as well as everything discussed in the chapter on fitting models using least squares, can be utilized for classification, and specific classification algorithms can be employed, for instance, to fit a series of noisy observations to a curve.

It is generally not feasible to produce a complete training set: it is not always possible to obtain every type of input-output association in order to systematically map the entire input space to the output space. Even if this were possible, it would still be costly to have the memory required to represent such associations in the form of a Look Up Table. These are the main reasons for the use of models in classification.

The fact that the training set cannot cover all possible input-output combinations, combined with the generation of a model optimized for such incomplete data, can lead to a lack of generalization in the training: elements not present in the training set may be misclassified due to excessive adaptation to the training set (the overfitting problem). This phenomenon is typically caused by an optimization phase that focuses more on reducing the error on the outputs rather than on generalizing the problem.

Returning to the ways to visualize a classifier, it often proves simpler and more generalizable to directly derive from the input data a surface in $\mathbb{R}^n$ that separates the categories in the n-dimensional input space. A new function $g$ can be defined that associates a unique output label to each group of inputs, in the form of

$\begin{displaymath} g: \mathbb{R}^{n} \to \mathbb{Y}=\{y_1, \ldots, y_m \} \end{displaymath}$

(4.2)

. This represents the second way to view a classifier.

The expression (4.1) can always be converted into the form (4.2) through a majority voting process:

$\begin{displaymath} g(\mathbf{x}) = \argmax_{y_i} p(y_i\vert\mathbf{x}) \end{displaymath}$

(4.3)

The classifier, from this perspective, is a function that directly returns the symbol most similar to the provided input. The training set must now associate each input (each element of the space) with exactly one output class $y \in \mathbb{Y}$ . Typically, this way of viewing a classifier allows for a reduction in computational complexity and resource utilization.

If the function (4.1) indeed represents a transfer function, a response, the function (4.2) can be viewed as a partitioning of the space $\mathbb{R}^{n}$ where regions, generally very complex and non-contiguous in the input space, are associated with a unique class.

For the reasons stated previously, it is physically impossible to achieve an optimal classifier (except for very small dimensional problems or for simple models that are perfectly known). However, there are several general purpose classifiers that can be considered sub-optimal depending on the problem and the required performance.

In the case of classifiers (4.2), the challenge is to obtain an optimal partitioning of the space; thus, a set of fast primitives is required that do not consume too much memory in the case of high values of $n$ . Meanwhile, in case (4.1), a function that models the problem very well is explicitly required, while avoiding specializations.

The information (features) that can be extracted from an image for classification purposes is manifold. Generally, using the grayscale/color values of the image directly is rarely employed in practical applications because such values are typically influenced by the scene's brightness and, most importantly, because they would represent a very large input space that is difficult to manage.

Therefore, it is essential to extract from the portion of the image to be classified the critical information (features) that best describes its appearance. For this reason, the entire theory presented in section 6 is widely used in machine learning.

Indeed, both Haar features are extensively utilized due to their extraction speed, and Histograms of Oriented Gradients (HOG, section 6.2) are favored for their accuracy. As a compromise and generalization between the two classes of features, Integral Channel Features (ICF, section 6.3) have recently been proposed.

To reduce the complexity of the classification problem, it can be divided into multiple layers to be addressed independently: the first layer transforms the input space into the feature space, while a second layer performs the actual classification starting from the feature space.

Under this consideration, classification techniques can be divided into three main categories:

Rule-based learning: In this case, both the feature space and the parameters of the classification function are determined by a human user, without utilizing any dataset or training examples;
Machine Learning: The transformation from input space to feature space is chosen by the user from a finite set of functions, while the extraction of model parameters is left to the processor by analyzing the provided examples;
Representation learning: Both the transformation to feature space and the extraction of model parameters are performed by the computer.

Recently, techniques of Representation Learning constructed with multiple cascading layers (Deep Learning) have been very successful in solving complex classification problems.

Among the techniques for transforming the input space into the feature space, it is important to mention PCA, a linear unsupervised technique. The Principal Component Analysis (section 2.10.1) is a method that allows for the reduction of the number of inputs to the classifier by removing linearly dependent or irrelevant components, thereby reducing the dimensionality of the problem while striving to preserve as much information as possible.

Regarding the models and modeling techniques, widely used are:

Regression: a regression of data to a model is essentially a classifier. Consequently, all the theory presented in chapter 3 can and should be applied to classify data;
Neural Networks: Neural networks allow the generation of functions of the type (4.1) by concatenating sums, multiplications, and strongly nonlinear functions such as sigmoids. Regression techniques enable the estimation of the parameters of this generic model;
Bayesian Classifiers: It is possible to use Bayes' theorem directly as a classifier or to combine multiple classifiers in order to maximize the a posteriori probability of identifying the correct class (section 4.2);
Decision Tree: where classifiers are cascaded with other classifiers (and each node represents some attribute extracted from the input data);
Decision Stump: a degenerate decision tree (1 node), allows partitioning the feature space using a simple threshold, thus becoming the simplest classifier of type (4.2) and an example of a weak classifier;
Ensemble Learning: Multiple weak classifiers can be related to each other (Ensemble Learning, see section 4.6) in order to maximize some global metric (for example, the margin of separation between classes). In fact, they are not true classifiers but rather techniques for combining several simple classifiers to generate a complex classifier (ensemble).

**Figure:** Example of Template Matching. The approach works well on the source image, but it cannot be extended to other images, especially with significant variations in brightness and scale.

Subsections

Paolo medici
2025-10-22