SVM and Kernel Functions

Despite the Soft Margin, some problems are intrinsically non-separable in the feature space. However, from the understanding of the problem, it is possible to infer that a nonlinear transformation $\phi:X \to F$ transforms the input feature space $\mathcal{X}$ into the feature space $\mathcal{F}$ where the separating hyperplane allows for better discrimination of the categories. The discriminant function in the space $\mathcal{F}$ is given by

\begin{displaymath}
f(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x}) + b
\end{displaymath} (4.30)

To enable separation, the space $\mathcal{F}$ is typically of higher dimensions than the space $\mathcal{X}$. This increase in dimensions leads to a rise in the computational complexity of the problem and an increased demand for resources. Kernel methods address this issue.

The vector $\mathbf{w}$ is a linear combination of the training samples (the support vectors in the case of hard margin):

\begin{displaymath}
\mathbf{w} = \sum_i \alpha_i \phi(\mathbf{x}_i)
\end{displaymath} (4.31)

The discriminant function thus takes the form
\begin{displaymath}
\begin{array}{rl}
f(\mathbf{x}) & = \sum_i \alpha_i \phi(\m...
...& = \sum_i \alpha_i k(\mathbf{x}, \mathbf{x}_i) + b
\end{array}\end{displaymath} (4.32)

with the evaluation of the kernel function $k(\mathbf{x},\mathbf{x}')$.

At the time of evaluating the discriminant function, it is therefore necessary to utilize the support vectors (at least those with a non-negligible associated parameter $\alpha_i$). In fact, SVM with a kernel identifies certain samples from the training set as useful information to determine how close the evaluation sample in question is to them.

The bias is calculated instantaneously from equation (4.32), averaging

\begin{displaymath}
b = \E[ y_j - \sum_i \alpha_i k(\mathbf{x}_j, \mathbf{x}_i) ]
\end{displaymath} (4.33)

The most commonly used kernels, due to their ease of evaluation, are Gaussian kernels in the form

\begin{displaymath}
k(\mathbf{x},\mathbf{x}') = e^{-\gamma \Vert\mathbf{x}-\mathbf{x}'\Vert^2 }
\end{displaymath} (4.34)

with $\gamma$ as the parameter to be set, and polynomial kernels of degree $d$ in the form
\begin{displaymath}
k(\mathbf{x},\mathbf{x}') = (\mathbf{x}^\top \mathbf{x}' + 1)^d
\end{displaymath} (4.35)

where in the case of $d=1$, the formulation reduces to the linear case.

The use of kernel functions, combined with the ability to precalculate all combinations $k(\mathbf{x}_i,\mathbf{x}_j)$, allows for the establishment of a common interface between linear and nonlinear training, effectively maintaining the same level of performance.

It is noteworthy that the prediction $f(\mathbf{x})=\mathbf{w}^{\top} \phi(\mathbf{x})$ takes the form of

\begin{displaymath}
\phi(\mathbf{x}) = \left[ k_1(\mathbf{x}, \mathbf{x}_1), \ldots, k_n(\mathbf{x}, \mathbf{x}_n) \right]
\end{displaymath} (4.36)

where $\mathbf{x}_i$ represents a subset of the training data. The models written in this form effectively perform a template matching between the sample $\mathbf {x}$ to be evaluated and the prototypes $\mathbf{x}_i$.

Paolo medici
2025-10-22