Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Support Vector Machines

Maximum-margin classifiers via convex optimization: hard margin, soft margin with slack variables, hinge loss, the dual formulation, and the kernel trick.

CoreTier 1Stable~70 min

Why This Matters

margin = 2/‖w‖Class +1Class -1w·x + b = 0Circled points are support vectors (on the margin boundary). Only they determine the decision boundary.+1-1

Support vector machines were the dominant classification method from the mid-1990s through the early 2010s, and for good reason: they have a clean mathematical formulation rooted in convex optimization, come with strong generalization guarantees via margin theory, and extend to nonlinear classification through the kernel trick. Even in the deep learning era, SVMs remain the clearest example of how optimization geometry connects to statistical generalization.

Mental Model

You have two classes of points in Rd\mathbb{R}^d. You want to find a separating hyperplane that is as far as possible from both classes. The "margin" is the width of the gap between the classes. Maximizing the margin is equivalent to minimizing w\|w\| subject to all points being correctly classified. This is a convex quadratic program, and its dual reveals that the solution depends only on dot products between data points. opening the door to kernels.

Hard Margin SVM

Definition

Separating Hyperplane

A hyperplane in Rd\mathbb{R}^d defined by weight vector wRdw \in \mathbb{R}^d and bias bRb \in \mathbb{R}. A point xix_i with label yi{1,+1}y_i \in \{-1, +1\} is correctly classified if yi(wTxi+b)>0y_i(w^T x_i + b) > 0.

Definition

Functional Margin

The functional margin of a point (xi,yi)(x_i, y_i) is yi(wTxi+b)y_i(w^T x_i + b). Correct classification means positive functional margin. The geometric margin is the functional margin divided by w\|w\|, giving the actual Euclidean distance from the point to the hyperplane.

Definition

Hard Margin SVM

For linearly separable data {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n, the hard margin SVM solves:

minw,b12w2s.t.yi(wTxi+b)1    i\min_{w, b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^T x_i + b) \geq 1 \;\; \forall i

The margin (distance between the two class boundaries) is 2w\frac{2}{\|w\|}. Maximizing this margin is equivalent to minimizing w2\|w\|^2.

The Dual Formulation

Theorem

Hard Margin SVM Dual Problem

Statement

The Lagrangian dual of the hard margin SVM is:

maxαi=1nαi12i,jαiαjyiyjxiTxj\max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j

subject to αi0\alpha_i \geq 0 for all ii and iαiyi=0\sum_i \alpha_i y_i = 0.

At optimality, w=iαiyixiw^* = \sum_i \alpha_i^* y_i x_i, and αi>0\alpha_i^* > 0 only for support vectors: points lying exactly on the margin boundary.

Intuition

The dual says: find weights αi\alpha_i on training points such that the resulting classifier w=αiyixiw = \sum \alpha_i y_i x_i maximizes margin. Most αi\alpha_i are zero. Only the points closest to the decision boundary (the support vectors) have nonzero weight, so the classifier depends on only a few critical training examples.

Proof Sketch

Form the Lagrangian L(w,b,α)=12w2iαi[yi(wTxi+b)1]L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_i \alpha_i[y_i(w^T x_i + b) - 1]. Set wL=0\nabla_w L = 0 to get w=iαiyixiw = \sum_i \alpha_i y_i x_i. Set L/b=0\partial L / \partial b = 0 to get iαiyi=0\sum_i \alpha_i y_i = 0. Substitute back into LL to eliminate ww and bb, yielding the dual. Strong duality holds because the primal is convex and Slater's condition is satisfied (any strictly feasible point suffices).

Why It Matters

The dual formulation is essential for two reasons: (1) the data appears only through dot products xiTxjx_i^T x_j, enabling the kernel trick, and (2) the dual variables directly identify the support vectors.

Failure Mode

The hard margin SVM has no solution if the data is not linearly separable. Even a single mislabeled point or outlier makes the primal infeasible. This motivates the soft margin formulation.

KKT Conditions

The Karush-Kuhn-Tucker conditions for the hard margin SVM are:

  1. Stationarity: w=iαiyixiw = \sum_i \alpha_i y_i x_i
  2. Primal feasibility: yi(wTxi+b)1y_i(w^T x_i + b) \geq 1
  3. Dual feasibility: αi0\alpha_i \geq 0
  4. Complementary slackness: αi[yi(wTxi+b)1]=0\alpha_i[y_i(w^T x_i + b) - 1] = 0

Complementary slackness is the key insight: either αi=0\alpha_i = 0 (the point is not a support vector) or yi(wTxi+b)=1y_i(w^T x_i + b) = 1 (the point is exactly on the margin boundary). There is no middle ground.

Soft Margin SVM and Hinge Loss

Definition

Soft Margin SVM

When data is not linearly separable, introduce slack variables ξi0\xi_i \geq 0:

minw,b,ξ12w2+Ci=1nξis.t.yi(wTxi+b)1ξi,    ξi0\min_{w, b, \xi} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^n \xi_i \quad \text{s.t.} \quad y_i(w^T x_i + b) \geq 1 - \xi_i, \;\; \xi_i \geq 0

The parameter C>0C > 0 trades off margin width against margin violations. Large CC penalizes violations heavily (close to hard margin); small CC allows more violations for a wider margin.

Definition

Hinge Loss

The soft margin SVM is equivalent to minimizing the hinge loss:

minwλ2w2+1ni=1nmax(0,1yi(wTxi+b))\min_w \frac{\lambda}{2}\|w\|^2 + \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i(w^T x_i + b))

where λ=1/(nC)\lambda = 1/(nC). The hinge loss is convex, non-differentiable at yf(x)=1y \cdot f(x) = 1, and zero for correctly classified points with margin 1\geq 1. This is a regularized ERM problem with hinge loss.

The Kernel Trick

Since the dual depends on data only through dot products xiTxjx_i^T x_j, we can replace these with a kernel function k(xi,xj)=ϕ(xi)Tϕ(xj)k(x_i, x_j) = \phi(x_i)^T \phi(x_j) where ϕ\phi maps to a (potentially infinite-dimensional) feature space. The classifier becomes:

f(x)=i=1nαiyik(xi,x)+bf(x) = \sum_{i=1}^n \alpha_i y_i k(x_i, x) + b

We never compute ϕ(x)\phi(x) explicitly. WE only evaluate the kernel. Common kernels include:

  • Polynomial: k(x,z)=(xTz+c)dk(x, z) = (x^T z + c)^d
  • Gaussian RBF: k(x,z)=exp(xz2/(2σ2))k(x, z) = \exp(-\|x - z\|^2 / (2\sigma^2))

The RBF kernel maps to an infinite-dimensional feature space, yet we can compute the kernel in O(d)O(d) time.

Theorem

Representer Theorem for SVMs

Statement

For any regularized risk minimization problem of the form minfHkλ2fHk2+1ni(yi,f(xi))\min_{f \in \mathcal{H}_k} \frac{\lambda}{2}\|f\|_{\mathcal{H}_k}^2 + \frac{1}{n}\sum_i \ell(y_i, f(x_i)), the minimizer has the representation f(x)=i=1nαik(xi,x)f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x).

Intuition

You never need to search over the full (infinite-dimensional) RKHS. The optimal function is always a finite linear combination of kernel evaluations at the training points. This is why the kernel trick is computationally feasible.

Proof Sketch

Decompose f=f+ff = f_\parallel + f_\perp where ff_\parallel lies in span{k(xi,)}\text{span}\{k(x_i, \cdot)\} and ff_\perp is orthogonal. Then f(xi)=f(xi)f(x_i) = f_\parallel(x_i) for all training points (by reproducing property), but f2=f2+f2f2\|f\|^2 = \|f_\parallel\|^2 + \|f_\perp\|^2 \geq \|f_\parallel\|^2. So ff_\perp only increases the regularizer without affecting the loss.

Why It Matters

The representer theorem reduces an infinite-dimensional optimization to a finite-dimensional one (finding nn coefficients αi\alpha_i), making kernel methods computationally tractable.

Failure Mode

The kernel matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j) is n×nn \times n. For large nn, storing and inverting this matrix becomes prohibitive (O(n2)O(n^2) memory, O(n3)O(n^3) solve time). This is why SVMs do not scale as well as SGD-based methods to very large datasets.

Connection to Regularization

The soft margin SVM is equivalent to regularized ERM with hinge loss:

minwλ2w2+1ni=1nhinge(yi,wTxi)\min_w \frac{\lambda}{2}\|w\|^2 + \frac{1}{n}\sum_{i=1}^n \ell_{\text{hinge}}(y_i, w^T x_i)

This is the same framework as ridge regression (squared loss + L2 regularization) or regularized logistic regression (logistic loss + L2). The only difference is the loss function. The hinge loss is special because it creates sparsity in the dual: most αi=0\alpha_i = 0, giving the "support vector" property.

Canonical Examples

Example

Hard margin in 2D

Consider two classes: {(1,1),(2,2)}\{(1,1), (2,2)\} with label +1+1 and {(1,1),(2,2)}\{(-1,-1), (-2,-2)\} with label 1-1. The optimal separating hyperplane passes through the origin with w(1,1)w \propto (1,1). The margin is 2w\frac{2}{\|w\|} and the support vectors are (1,1)(1,1) and (1,1)(-1,-1), the closest points to the boundary from each class.

Common Confusions

Watch Out

SVMs do not maximize accuracy. THEY maximize margin

An SVM does not directly maximize classification accuracy on the training set. Many hyperplanes may achieve 100% training accuracy on separable data. The SVM picks the one with the largest margin. The theoretical justification is that large margin implies small generalization error (via VC dimension bounds on the margin classifier class).

Watch Out

The kernel trick is not specific to SVMs

Any algorithm whose computation depends only on dot products between data points can be kernelized. This includes kernel PCA, kernel ridge regression, and kernel k-means. SVMs popularized the kernel trick, but it is a general principle.

Why SVMs Were Dominant Pre-Deep-Learning

From roughly 1995 to 2012, SVMs were the go-to method for classification:

  • Convex optimization: a single global optimum, no local minima
  • Strong theory: margin-based generalization bounds
  • Kernels: nonlinear classification without manual feature engineering
  • Sparsity: the solution depends on few support vectors

Deep learning overtook SVMs because neural networks scale better to massive datasets (SGD is O(n)O(n) per epoch vs O(n2)O(n^2)-O(n3)O(n^3) for kernel SVMs) and learn hierarchical features automatically.

Summary

  • Hard margin SVM: minimize w2\|w\|^2 subject to yi(wTxi+b)1y_i(w^T x_i + b) \geq 1
  • Margin = 2/w2/\|w\|; maximizing margin = minimizing w\|w\|
  • Dual depends on data only through dot products \to kernel trick
  • Support vectors: points with αi>0\alpha_i > 0, lying on the margin
  • Soft margin: slack variables ξi\xi_i, parameter CC controls tradeoff
  • Hinge loss max(0,1yf(x))\max(0, 1 - yf(x)): convex surrogate for 0-1 loss

Exercises

ExerciseCore

Problem

Derive the dual of the hard margin SVM. Start from the primal minw,b12w2\min_{w,b} \frac{1}{2}\|w\|^2 subject to yi(wTxi+b)1y_i(w^T x_i + b) \geq 1, introduce Lagrange multipliers αi0\alpha_i \geq 0, and eliminate ww and bb.

ExerciseAdvanced

Problem

Show that for the soft margin SVM, the dual variables satisfy 0αiC0 \leq \alpha_i \leq C. What does αi=C\alpha_i = C mean geometrically?

Related Comparisons

References

Canonical:

  • Vapnik, The Nature of Statistical Learning Theory (1995), Chapters 5 and 10
  • Cristianini & Shawe-Taylor, An Introduction to Support Vector Machines (2000)

Current:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 15-16

  • Boyd & Vandenberghe, Convex Optimization (2004), for duality theory

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

The natural next step from SVMs:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics