ADAptive BOOSTing

One of the Ensemble classifiers that has garnered significant interest from researchers in recent years is undoubtedly AdaBoost (FS95). The fundamental idea behind AdaBoost is to construct a list of classifiers by iteratively assigning a weight to each new classifier based on its ability to recognize samples that were not correctly identified by the other classifiers already involved in the training process. All these classifiers will cast their votes with the weights assigned to them, and the final decision will be made by majority vote.

The Boosting techniques allow for the generation of a classifier in the form of an additive model:

\begin{displaymath}
F_T(\mathbf{x}) = f_1(\mathbf{x}) + f_2(\mathbf{x}) + \ldots + f_T(\mathbf{x}) = \sum_{t=1}^{T} f_t(\mathbf{x})
\end{displaymath} (4.52)

with $f_1,\ldots,f_T$ individual classifiers.

Let us examine the case of binary classification, and let $S=(\mathbf{x}_1,y_1) \ldots (\mathbf{x}_m, y_m) \in (\mathbb{X} \times \{-1,1\} )$ be the set of $m$ samples available for training.

A common choice in the context of model fitting is to use least squares regression (an optimal metric in the presence of Gaussian noise, for example) to obtain the additive model $F_T(\mathbf{x})$, by minimizing the quantity $\sum \left( y_i - F_T(\mathbf{x}_i) \right)^2$. However, following numerous experiments, it has been observed that the quadratic cost function is not the optimal choice in classification problems.

The AdaBoost approach suggests that the combination of all these classifiers minimizes a different, more favorable cost function, namely the exponential loss function (exponential loss):

\begin{displaymath}
\min_{F} \sum_{i=1}^{m} e^{-y_i F(\mathbf{x}_i)}
\end{displaymath} (4.53)

Since the global minimization of the function (4.53) is usually impossible, there are two possible approaches to proceed:

AdaBoost addresses the classification problem using the second approach.

Under these considerations, the objective of the training process is reduced to identifying an additional classifier $f(\mathbf{x})$ that minimizes the quantity

\begin{displaymath}
f_{T+1} = \argmin_{f} \sum_{i=1}^{m} e^{-y_i \left( F_T(\mat...
...} =
\argmin_{f} \sum_{i=1}^{m} w_i e^{-y_i f(\mathbf{x}_i)}
\end{displaymath} (4.54)

each time, having defined $w_i = e^{-y_i F_T(\mathbf{x}_i) }$ and leveraging the properties of the exponential function.

AdaBoost is a technique that addresses all these requirements.

Let us assume that we have at our disposal $\mathcal{H}=\{h_1, \ldots, h_T\}$ binary classifiers, each of which, by evaluating the feature $\mathbf{x}_i$, with $1 \leq i \leq m$, returns an opinion $y_i=\{-1,+1\}$.

Let the function $F_T(\mathbf{x};\boldsymbol\alpha)$, defined as

\begin{displaymath}
F_T(\mathbf{x};\alpha_1, \ldots, \alpha_T) = \sum_{t=1}^{T} \alpha_t h_t(\mathbf{x})
\end{displaymath} (4.55)

be a function whose sign represents the classification hypothesis and whose magnitude reflects the quality of the prediction. The model described by equation (4.55) is referred to as the Extended Additive Model or Adaptive Basis-Function Model.

The objective is to obtain a strong classifier $H(\mathbf{x}_i)$ as a weighted linear sum of the classifiers $h_t$, whose sign determines the global hypothesis:

\begin{displaymath}
H(\mathbf{x}_i) = \sgn \left( \sum^{T}_{t=1} \alpha_t h_t(\mathbf{x}_i) \right) = \sgn F_T(\mathbf{x}_i;\boldsymbol\alpha)
\end{displaymath} (4.56)

This is a majority voting scheme: the hypothesis that receives the most votes from the classifiers, each with a different weight $\alpha_t$, is chosen as the winner. It is precisely the constants $\alpha_t$, the weights assigned to each classifier, that are the result provided by this training technique.

To enable the assignment of a score to the classifier, it is necessary that each input sample $x_i$ is assigned a certain weight $w_i$: the higher the weight, the more the sample has been misclassified up to this point in the training, while the lower the weight, the more it has been classified correctly. In the first iteration, all weights are set equal to $w_i^{(0)}=1/m$, ensuring an accurate statistical distribution. Variants such as Asymmetric AdaBoost assign different weights to the various categories involved.

Let $u_i=y_i h_t(\mathbf{x}_i)$ be the function that expresses the success ($+1$) or failure ($-1$) of the classifier $h_t$ in evaluating the sample $x_i$. Given the weights associated with each sample, it is possible for each classifier to calculate $W_{-1}$, the sum of the weights associated with failures, and $W_{+1}$, the sum of the weights associated with correct classifications, through the definition of $u_i$, in compact form as follows:

\begin{displaymath}
W_b = \sum_{ u_i = b } w_i
\end{displaymath} (4.57)

with $b=+1$, representing success, and $b=-1$, representing failure.

Let $\epsilon_t$ be the measure of the classifier's error $h_t$ calculated as

\begin{displaymath}
\epsilon_t = \sum_{y_i \neq h_t(i) } w_i^{(t)} = \sum_{u_i=-1} w_i^{(t)} = W_{-}
\end{displaymath} (4.58)

, the sum of the weights associated only with the samples classified incorrectly, and let
\begin{displaymath}
r_t = W_{+}-W_{-} = \sum_{i=1}^{m} w^{(t)}_i u_i
\end{displaymath} (4.59)

be the weighted average, using the weights $w_i$, of the classification performances $u_i$.

The iterations of the AdaBoost algorithm are as follows:

  1. An Oracle provides a classifier $h_t$ (the choice is essentially left to the user, aiming to select the classifier that minimizes the error $\epsilon_t$, although it is not mandatory for it to be the best);
  2. The error $\epsilon_t$ produced by the classifier $h_t$ on the input samples is calculated. When it is not possible to find a classifier for which $\epsilon_t > 1/2$, the training cannot proceed and must therefore be terminated;
  3. Given the error, the classifier $h_t$ is assigned a weight $\alpha_t$, calculated as described subsequently;
  4. For each sample $x_i$, the associated distribution $w_i^{(t+1)}$ is updated through the function
    \begin{displaymath}
w_i^{(t+1)} = \frac{1}{Z_t} w_i^{(t)} e^{ - \alpha_t u_i } = \frac{1}{Z_t} w_i^{(t)} e^{ - y_i f_t (\mathbf{x}_i) }
\end{displaymath} (4.60)

    .
The weight associated with samples that were successfully classified is decreased by an amount proportional to $e^{-\alpha_t}$, while the weight of samples that were misclassified is increased by $e^{\alpha_t}$. $Z_t$ is a normalization factor chosen such that $\sum w_i^{(t)}=1$, but it also holds significant meaning as explained immediately below.

The normalization parameter $Z_t$ is given by

\begin{displaymath}
Z_t = \sum_{i=1}^{m} w_i^{(t)} e^{ - \alpha_t u_i } = e^{ - \alpha_t} W_{+} + e^{ \alpha_t} W_{-}
\end{displaymath} (4.61)

and, an important result of AdaBoost, it can be demonstrated that the classification error is upper-bounded by
\begin{displaymath}
\frac{1}{m} \{ i : H(x_i) \neq y_i \} \leq \prod_{t=1}^{T} Z_t
\end{displaymath} (4.62)

For this reason, $Z_t$ is precisely the quantity to minimize in order to obtain the optimal classifier. A direct consequence of this result is that AdaBoost can be viewed as a framework that minimizes $\prod_t Z_t$.

The optimal choice of $\alpha_t$ (and consequently that of $h_t$) is the one where the function (4.61) reaches its minimum, specifically at

\begin{displaymath}
\alpha_t = \frac{1}{2} \log \left( \frac{1-\epsilon_t}{\epsi...
... W_{-} } = \frac{1}{2} \log \left( \frac{1+r_t}{1-r_t} \right)
\end{displaymath} (4.63)

. With this particular choice of $\alpha_t$, $Z_t$ also reaches its minimum and is equal to
\begin{displaymath}
Z_t = 2 \sqrt{ \epsilon_t (1-\epsilon_t) } = 2 \sqrt{ W_{-} W_{+} }
\end{displaymath} (4.64)

. From equation (4.64), it follows that $Z_t$ is minimized by selecting the classifier $h_t$ that has the lowest value of $\epsilon_t$, which corresponds to the maximum of $W_{+}$.

Choosing the weight from equation (4.63) that minimizes $Z_t$, after each iteration of AdaBoost, the weights associated with correctly identified samples are decreased by a factor of $\exp(-\alpha_t)$, that is $\sqrt{ W_{-} / W_{+} }$, while the weights associated with samples incorrectly classified by the hypothesis $h_t$ are increased by a factor of $\exp(\alpha_t)$, that is $\sqrt{ W_{+} / W_{-} }$.

This algorithm is what is referred to in the literature as AdaBoost.M1 or Discrete AdaBoost (FHT00). The hypotheses $h_t(\mathbf{x})$ used by AdaBoost are features that can take on only the values $\{+1,-1\}$.

The intuitive functioning of AdaBoost is quite straightforward: AdaBoost focuses on the input patterns that have been misclassified the most as each new classifier is added to the ensemble.

AdaBoost has several interpretations: as a classifier that maximizes the margin, logistic regression for an additive model, as a discrete step gradient descent minimizer, and also as regression using Newton's method.

AdaBoost, like SVM, aims to maximize the separation margin between classes, albeit using different metrics. In this way, both methods are able to be less sensitive to issues such as overfitting.

Paolo medici
2025-10-22