The problem of Boosting can be generalized and viewed as a problem where it is necessary to search for predictors
that minimize the global cost function:
| (4.65) |
From an analytical perspective, AdaBoost is an example of a coordinate-wise gradient descent optimizer that minimizes the potential function
, optimizing one coefficient
at a time (LS10), as can be seen from equation (4.54).
A non-exhaustive list that sheds light on some peculiarities of this technique, the variants of AdaBoost are:
AdaBoost can also be extended to cases involving classifiers with abstention, where the possible outputs are
. By expanding the definition (4.57), for simplicity let
denote the failures,
the abstentions, and
the successes of the classifier
.
In this case, also reaches the minimum with the same value as
in the case without abstention, see (4.63).
and with such a choice
would hold true
| (4.66) |
However, there exists a more conservative choice of proposed by Freund and Shapire
| (4.67) |
Real AdaBoost generalizes the previous case but, above all, generalizes the same extended additive model (FHT00). Instead of using dichotomous hypotheses and associating a weight
with them, it directly seeks the feature
that minimizes the equation (4.54).
Real AdaBoost allows the use of weak classifiers that provide the probability distribution
, the probability that class
is indeed
given the observation of feature
.
Given a probability distribution , the feature
, which minimizes equation (4.54), is
Both Discrete and Real AdaBoost, by selecting a weak classifier that satisfies equation 4.68, ensure that AdaBoost asymptotically converges to
Real AdaBoost can also be used with a discrete classifier such as the Decision Stump. By directly applying the equation (4.68) to the two possible output states of the Decision Stump (it is still straightforward to obtain the minimum of algebraically), the classifier's responses must take the values
| (4.70) |
| (4.71) |
Gentle AdaBoost further generalizes the concept of Ensemble Learning to additive models (FHT00) by employing regression with steps typical of Newton's methods: It seems you have provided a placeholder for a mathematical block, but there is no content to translate. Please provide the text or equations you would like me to translate, and I'll be happy to assist!
The hypothesis , to be added to the additive model at iteration
, is selected from all possible hypotheses
as the one that optimizes a weighted least squares regression
Also, Gentle AdaBoost can be used with the Decision Stump. In this case, the minimum of (4.72) of the decision algorithm takes on a remarkable form in
| (4.73) |
For historical reasons, AdaBoost does not explicitly exhibit a statistical formalism. The first observation is that the output of the AdaBoost classifier is not a probability, as it is not constrained between . In addition to this issue, which is partially addressed by Real AdaBoost, minimizing the loss-function (4.53) does not appear to be a statistical approach, unlike maximizing the likelihood. However, it is possible to demonstrate that the cost function of AdaBoost maximizes a function very similar to the Bernoulli log-likelihood.
For these reasons, it is possible to extend AdaBoost to the theory of logistic regression, as described in section 3.7.
The additive logistic regression takes the form
The problem then becomes one of finding an appropriate loss function for this representation, specifically identifying a variant of AdaBoost that precisely maximizes the Bernoulli log-likelihood (FHT00).
Maximizing the likelihood of (4.75) is equivalent to minimizing the log-loss
| (4.76) |
LogitBoost first extends AdaBoost to the problem of logistic optimization of a function under the cost function
, maximizing the Bernoulli log-likelihood using Newton-type iterations.
The weights associated with each sample are derived directly from the probability distribution
| (4.77) |
This behavior can be represented by a cost function of the form
| (4.78) |
Paolo medici