Subsections

AdaBoost and Its Variants

Figure 4.7: Comparison of loss functions: 0/1-loss, logistic, exponential, and quadratic
Image fig_lossfunction

The problem of Boosting can be generalized and viewed as a problem where it is necessary to search for predictors $f_t(\mathbf{x})$ that minimize the global cost function:

\begin{displaymath}
\sum_{i=1}^{m} \phi \left( y_i \left( f_1(\mathbf{x}_i) + \ldots + f_n(\mathbf{x}_i) \right) \right)
\end{displaymath} (4.65)

where $\phi \in \mathcal{C}^1$ is a convex function that is non-increasing with respect to $\lim_{z \to \infty} \phi(z)=0$.

From an analytical perspective, AdaBoost is an example of a coordinate-wise gradient descent optimizer that minimizes the potential function $\phi(z)=e^{-z}$, optimizing one coefficient $\alpha_t$ at a time (LS10), as can be seen from equation (4.54).

A non-exhaustive list that sheds light on some peculiarities of this technique, the variants of AdaBoost are:

AdaBoost with Abstention

AdaBoost can also be extended to cases involving classifiers with abstention, where the possible outputs are $h_j(x_i) \in \{-1,0,+1\}$. By expanding the definition (4.57), for simplicity let $W_{-}$ denote the failures, $W_{0}$ the abstentions, and $W_{+}$ the successes of the classifier $h_t$.

In this case, $Z_t$ also reaches the minimum with the same value as $\alpha_t$ in the case without abstention, see (4.63). and with such a choice $Z_t$ would hold true

\begin{displaymath}
Z_t = W_0 + 2 \sqrt{ W_{-} W_{+}}
\end{displaymath} (4.66)

However, there exists a more conservative choice of $\alpha_t$ proposed by Freund and Shapire

\begin{displaymath}
\alpha_t = \frac{1}{2} \log \left( \frac{W_{+} + 1/2 W_0}{W_{-} + 1/2 W_0} \right)
\end{displaymath} (4.67)

that allows for setting an upper limit on $Z_t$.

Real AdaBoost

Real AdaBoost generalizes the previous case but, above all, generalizes the same extended additive model (FHT00). Instead of using dichotomous hypotheses $h_t(x)$ and associating a weight $\alpha_t$ with them, it directly seeks the feature $f_t(x)$ that minimizes the equation (4.54).

Real AdaBoost allows the use of weak classifiers that provide the probability distribution $p_t(x) = P[y=1 \vert x, w^{(t)} ] \in [0,1]$, the probability that class $y$ is indeed $+1$ given the observation of feature $x$.

Given a probability distribution $p_t(x)$, the feature $f_t(x)$, which minimizes equation (4.54), is

\begin{displaymath}
f_t(x) = \frac{1}{2} \log \frac{ P [y=+1 \vert x, w^{(t)} ] ...
...vert x, w^{(t)} ] } = \frac{1}{2} \log \frac{p_t(x)}{1-p_t(x)}
\end{displaymath} (4.68)

. This result is equal to half of the logistic transformation. Since the objective remains to minimize the exponential cost function, the weight update still follows equation (4.60).

Both Discrete and Real AdaBoost, by selecting a weak classifier that satisfies equation 4.68, ensure that AdaBoost asymptotically converges to

\begin{displaymath}
\lim_{T \to \infty} F_T(x) = \frac{1}{2} \log \frac{P[y=+1\vert x]}{P[y=-1\vert x]}
\end{displaymath} (4.69)

demonstrating how the AdaBoost algorithm is an iterative procedure that combines multiple weak classifiers to approximate a Bayesian classifier.

Real AdaBoost can also be used with a discrete classifier such as the Decision Stump. By directly applying the equation (4.68) to the two possible output states of the Decision Stump (it is still straightforward to obtain the minimum of $Z_t$ algebraically), the classifier's responses must take the values

\begin{displaymath}
f(x) = \left\{ \begin{array}{ll}
\frac{1}{2} \log \frac{W_...
...\log \frac{W_{FN}}{W_{TN}} & x \leq \theta
\end{array}\right.
\end{displaymath} (4.70)

with the values $W_{*}$, which represent the sum of the weights associated with False Positives (FP), False Negatives (FN), True Positives (TP), and True Negatives (TN). With this choice of values, $Z_t$ takes on a significant value of
\begin{displaymath}
Z_t = 2 \left( \sqrt{W_{TP}W_{FP}} + \sqrt{W_{FN}W_{TN}} \right)
\end{displaymath} (4.71)

, which serves as a metric to use for selecting the best feature $x$ and threshold $\theta $.

Gentle AdaBoost

The weights associated with the outliers in Real AdaBoost can be very high due to the presence of the logarithm in the equation. In this case, it becomes necessary to make the regression more "gentle."

Gentle AdaBoost further generalizes the concept of Ensemble Learning to additive models (FHT00) by employing regression with steps typical of Newton's methods: It seems you have provided a placeholder for a mathematical block, but there is no content to translate. Please provide the text or equations you would like me to translate, and I'll be happy to assist!

The hypothesis $f_t(x)$, to be added to the additive model at iteration $t$, is selected from all possible hypotheses $f_k$ as the one that optimizes a weighted least squares regression

\begin{displaymath}
f_t = \argmin_{f_k} \sum_i w_i (y_i - f_k(x_i))^2
\end{displaymath} (4.72)

but for each iteration, the weight update from AdaBoost (4.60) is used, specifically the exponential loss function.

Also, Gentle AdaBoost can be used with the Decision Stump. In this case, the minimum of (4.72) of the decision algorithm takes on a remarkable form in

\begin{displaymath}
f(x) = \left\{ \begin{array}{ll}
\frac{W_{TP} - W_{FP} }{ ...
..._{TN} }{ W_{TN} + W_{FN} } & x \leq \theta
\end{array}\right.
\end{displaymath} (4.73)

LogitBoost

For historical reasons, AdaBoost does not explicitly exhibit a statistical formalism. The first observation is that the output of the AdaBoost classifier is not a probability, as it is not constrained between $[0,1]$. In addition to this issue, which is partially addressed by Real AdaBoost, minimizing the loss-function (4.53) does not appear to be a statistical approach, unlike maximizing the likelihood. However, it is possible to demonstrate that the cost function of AdaBoost maximizes a function very similar to the Bernoulli log-likelihood.

For these reasons, it is possible to extend AdaBoost to the theory of logistic regression, as described in section 3.7.

The additive logistic regression takes the form

\begin{displaymath}
\log \frac{P[y=+1\vert x]}{P[y=-1\vert x]} = F_T(x) = \sum_{t=1}^{T} f_t(x)
\end{displaymath} (4.74)

an interesting expression when compared to that of AdaBoost in equation (4.69). By inverting equation (4.74), we obtain the logistic relationship
\begin{displaymath}
p(x) = P[y=+1\vert x] = \frac{e^{F_T(x)} }{1 + e^{F_T(x)} } = \frac{1}{1+e^{-F_T(x)}}
\end{displaymath} (4.75)

which associates a probability estimate with the additive model $F(x)$.

The problem then becomes one of finding an appropriate loss function for this representation, specifically identifying a variant of AdaBoost that precisely maximizes the Bernoulli log-likelihood (FHT00).

Maximizing the likelihood of (4.75) is equivalent to minimizing the log-loss

\begin{displaymath}
\sum_{t=1}^{T} \log \left( 1 + \exp ( -y_i F_T(x_i) ) \right)
\end{displaymath} (4.76)

LogitBoost first extends AdaBoost to the problem of logistic optimization of a function $F_T(x)$ under the cost function $\phi(z)=\log \left(1 + e^{-z} \right)$, maximizing the Bernoulli log-likelihood using Newton-type iterations.

The weights associated with each sample are derived directly from the probability distribution

\begin{displaymath}
\begin{array}{l}
z_i = \frac{y_i^{*} - p(x_i)}{p(x_i) (1 - p(x_i) )} \\
w_i = \frac{ p(x_i) }{1 - p(x_i) }
\end{array}\end{displaymath} (4.77)

with $y^{*}=\left\{0,1\right\}$ and choosing the hypothesis $f_t(x)$ as the least squares regression of $z_i$ on $x_i$ using the weights $w_i$. The future estimate of $p(x_i)$ is directly derived from equation (4.75).

Asymmetric-AdaBoost

Asymmetric-AdaBoost introduces a variant in the weight update rule (VJ01). The issue with AdaBoost is that it does not allow for direct control over the weight assigned to classification errors across different classes and does not enable explicit minimization of the number of false positives, focusing only on the classification error. The Asymmetric-AdaBoost variants, on the other hand, modify at each iteration $t$ the weights associated with positive and negative samples by a cost factor $c^{(t)}_{+}$ and $c^{(t)}_{-}$, respectively.

Cascade

Regardless of the use of Cascade classifiers (VJ02), the weights are modified by a factor $\beta_t = \epsilon_t/(1-\epsilon_t) = W_{-}/W_{+}$ only in the case of correct classification; otherwise, the weights remain unchanged. The weight associated with a classifier is assigned as $\alpha_t = - \log \beta_t$, which is double the weight assigned by AdaBoost.M1.

MAdaBoost

The MAdaBoost algorithm features a distinct weight update mechanism aimed at reducing the contribution of outliers (or overly complex examples) during training. The maximum weight $w^{(t)}_i$ that a sample can assume is capped at the upper limit of $w^{(0)}_i$, which is the weight assigned at the beginning of the algorithm.

This behavior can be represented by a cost function of the form

\begin{displaymath}
\phi(z)=\left\{\begin{array}{ll}
1-z & z \leq 0 \\
e^{-z} & z > 0 \\
\end{array}\right.
\end{displaymath} (4.78)

Paolo medici
2025-10-22