Empirical Risk Minimization

The Soft Margin constraint (4.26) can be rewritten as

$\begin{displaymath} y_i f(\mathbf{x}_i) \ge 1 - \xi_i \end{displaymath}$

(4.37)

where $f(\mathbf{x}_i)$ can also be a generic kernel function. This inequality is equivalent to

$\begin{displaymath} \xi_i \geq \max \left(0, 1 - y_i f(\mathbf{x}_i) \right) \end{displaymath}$

(4.38)

Since $\xi_i \ge 0$ . The loss function (4.38) is referred to as the hinge loss function (Hinge Loss)

$\begin{displaymath} \ell(y, \hat{y}) = \max(0, 1 - y \hat{y} ) \end{displaymath}$

(4.39)

and has the advantage of being convex and non-differentiable only at 1. The hinge loss is always greater than the 0/1 loss function.

The training problem of SVM in the case of non-linearly separable data is equivalent to an unconstrained optimization problem on $\mathbf{w}$ of the form

$\begin{displaymath} \min_{\mathbf{w} \in \mathbb{R}^{d}} \Vert \mathbf{w} \Vert^2 + C \sum_{i=1}^{N} \ell \left( y_i, f(\mathbf{x}_i) \right) \end{displaymath}$

(4.40)

The objective function continues to be described in two clearly distinct parts: the first is Tikhonov regularization, and the second is the empirical risk minimization with the hinge loss function. SVM can therefore be viewed as a linear classifier that optimizes the hinge loss function with L2 regularization.

The input data $\mathbf{x}_i$ can fall into 3 different categories:

$y_i f(\mathbf{x}_i) > 1$ are the points outside the margin and do not contribute to the cost function;
$y_i f(\mathbf{x}_i) = 1$ are the points on the margin and do not contribute to the cost as in the case of a "hard margin";
$y_i f(\mathbf{x}_i) < 1$ are the points that violate the constraint and contribute to the cost.

Paolo medici
2025-10-22