One of the Ensemble classifiers that has garnered significant interest from researchers in recent years is undoubtedly AdaBoost (FS95). The fundamental idea behind AdaBoost is to construct a list of classifiers by iteratively assigning a weight to each new classifier based on its ability to recognize samples that were not correctly identified by the other classifiers already involved in the training process. All these classifiers will cast their votes with the weights assigned to them, and the final decision will be made by majority vote.
The Boosting techniques allow for the generation of a classifier in the form of an additive model:
| (4.52) |
Let us examine the case of binary classification, and let
be the set of
samples available for training.
A common choice in the context of model fitting is to use least squares regression (an optimal metric in the presence of Gaussian noise, for example) to obtain the additive model
, by minimizing the quantity
. However, following numerous experiments, it has been observed that the quadratic cost function is not the optimal choice in classification problems.
The AdaBoost approach suggests that the combination of all these classifiers minimizes a different, more favorable cost function, namely the exponential loss function (exponential loss):
Since the global minimization of the function (4.53) is usually impossible, there are two possible approaches to proceed:
Under these considerations, the objective of the training process is reduced to identifying an additional classifier that minimizes the quantity
AdaBoost is a technique that addresses all these requirements.
Let us assume that we have at our disposal
binary classifiers, each of which, by evaluating the feature
, with
, returns an opinion
.
Let the function
, defined as
The objective is to obtain a strong classifier
as a weighted linear sum of the classifiers
, whose sign determines the global hypothesis:
| (4.56) |
To enable the assignment of a score to the classifier, it is necessary that each input sample is assigned a certain weight
: the higher the weight, the more the sample has been misclassified up to this point in the training, while the lower the weight, the more it has been classified correctly. In the first iteration, all weights are set equal to
, ensuring an accurate statistical distribution. Variants such as Asymmetric AdaBoost assign different weights to the various categories involved.
Let
be the function that expresses the success (
) or failure (
) of the classifier
in evaluating the sample
. Given the weights associated with each sample, it is possible for each classifier to calculate
, the sum of the weights associated with failures, and
, the sum of the weights associated with correct classifications, through the definition of
, in compact form as follows:
Let be the measure of the classifier's error
calculated as
| (4.58) |
| (4.59) |
The iterations of the AdaBoost algorithm are as follows:
The normalization parameter is given by
| (4.62) |
The optimal choice of (and consequently that of
) is the one where the function (4.61) reaches its minimum, specifically at
Choosing the weight from equation (4.63) that minimizes , after each iteration of AdaBoost, the weights associated with correctly identified samples are decreased by a factor of
, that is
, while the weights associated with samples incorrectly classified by the hypothesis
are increased by a factor of
, that is
.
This algorithm is what is referred to in the literature as AdaBoost.M1 or Discrete AdaBoost (FHT00). The hypotheses
used by AdaBoost are features that can take on only the values
.
The intuitive functioning of AdaBoost is quite straightforward: AdaBoost focuses on the input patterns that have been misclassified the most as each new classifier is added to the ensemble.
AdaBoost has several interpretations: as a classifier that maximizes the margin, logistic regression for an additive model, as a discrete step gradient descent minimizer, and also as regression using Newton's method.
AdaBoost, like SVM, aims to maximize the separation margin between classes, albeit using different metrics. In this way, both methods are able to be less sensitive to issues such as overfitting.
Paolo medici