ML Methods
Support Vector Machines
Maximum-margin classifiers via convex optimization: hard margin, soft margin with slack variables, hinge loss, the dual formulation, and the kernel trick.
Prerequisites
Why This Matters
Support vector machines were the dominant classification method from the mid-1990s through the early 2010s, and for good reason: they have a clean mathematical formulation rooted in convex optimization, come with strong generalization guarantees via margin theory, and extend to nonlinear classification through the kernel trick. Even in the deep learning era, SVMs remain the clearest example of how optimization geometry connects to statistical generalization.
Mental Model
You have two classes of points in . You want to find a separating hyperplane that is as far as possible from both classes. The "margin" is the width of the gap between the classes. Maximizing the margin is equivalent to minimizing subject to all points being correctly classified. This is a convex quadratic program, and its dual reveals that the solution depends only on dot products between data points. opening the door to kernels.
Hard Margin SVM
Separating Hyperplane
A hyperplane in defined by weight vector and bias . A point with label is correctly classified if .
Functional Margin
The functional margin of a point is . Correct classification means positive functional margin. The geometric margin is the functional margin divided by , giving the actual Euclidean distance from the point to the hyperplane.
Hard Margin SVM
For linearly separable data , the hard margin SVM solves:
The margin (distance between the two class boundaries) is . Maximizing this margin is equivalent to minimizing .
The Dual Formulation
Hard Margin SVM Dual Problem
Statement
The Lagrangian dual of the hard margin SVM is:
subject to for all and .
At optimality, , and only for support vectors: points lying exactly on the margin boundary.
Intuition
The dual says: find weights on training points such that the resulting classifier maximizes margin. Most are zero. Only the points closest to the decision boundary (the support vectors) have nonzero weight, so the classifier depends on only a few critical training examples.
Proof Sketch
Form the Lagrangian . Set to get . Set to get . Substitute back into to eliminate and , yielding the dual. Strong duality holds because the primal is convex and Slater's condition is satisfied (any strictly feasible point suffices).
Why It Matters
The dual formulation is essential for two reasons: (1) the data appears only through dot products , enabling the kernel trick, and (2) the dual variables directly identify the support vectors.
Failure Mode
The hard margin SVM has no solution if the data is not linearly separable. Even a single mislabeled point or outlier makes the primal infeasible. This motivates the soft margin formulation.
KKT Conditions
The Karush-Kuhn-Tucker conditions for the hard margin SVM are:
- Stationarity:
- Primal feasibility:
- Dual feasibility:
- Complementary slackness:
Complementary slackness is the key insight: either (the point is not a support vector) or (the point is exactly on the margin boundary). There is no middle ground.
Soft Margin SVM and Hinge Loss
Soft Margin SVM
When data is not linearly separable, introduce slack variables :
The parameter trades off margin width against margin violations. Large penalizes violations heavily (close to hard margin); small allows more violations for a wider margin.
Hinge Loss
The soft margin SVM is equivalent to minimizing the hinge loss:
where . The hinge loss is convex, non-differentiable at , and zero for correctly classified points with margin . This is a regularized ERM problem with hinge loss.
The Kernel Trick
Since the dual depends on data only through dot products , we can replace these with a kernel function where maps to a (potentially infinite-dimensional) feature space. The classifier becomes:
We never compute explicitly. WE only evaluate the kernel. Common kernels include:
- Polynomial:
- Gaussian RBF:
The RBF kernel maps to an infinite-dimensional feature space, yet we can compute the kernel in time.
Representer Theorem for SVMs
Statement
For any regularized risk minimization problem of the form , the minimizer has the representation .
Intuition
You never need to search over the full (infinite-dimensional) RKHS. The optimal function is always a finite linear combination of kernel evaluations at the training points. This is why the kernel trick is computationally feasible.
Proof Sketch
Decompose where lies in and is orthogonal. Then for all training points (by reproducing property), but . So only increases the regularizer without affecting the loss.
Why It Matters
The representer theorem reduces an infinite-dimensional optimization to a finite-dimensional one (finding coefficients ), making kernel methods computationally tractable.
Failure Mode
The kernel matrix is . For large , storing and inverting this matrix becomes prohibitive ( memory, solve time). This is why SVMs do not scale as well as SGD-based methods to very large datasets.
Connection to Regularization
The soft margin SVM is equivalent to regularized ERM with hinge loss:
This is the same framework as ridge regression (squared loss + L2 regularization) or regularized logistic regression (logistic loss + L2). The only difference is the loss function. The hinge loss is special because it creates sparsity in the dual: most , giving the "support vector" property.
Canonical Examples
Hard margin in 2D
Consider two classes: with label and with label . The optimal separating hyperplane passes through the origin with . The margin is and the support vectors are and , the closest points to the boundary from each class.
Common Confusions
SVMs do not maximize accuracy. THEY maximize margin
An SVM does not directly maximize classification accuracy on the training set. Many hyperplanes may achieve 100% training accuracy on separable data. The SVM picks the one with the largest margin. The theoretical justification is that large margin implies small generalization error (via VC dimension bounds on the margin classifier class).
The kernel trick is not specific to SVMs
Any algorithm whose computation depends only on dot products between data points can be kernelized. This includes kernel PCA, kernel ridge regression, and kernel k-means. SVMs popularized the kernel trick, but it is a general principle.
Why SVMs Were Dominant Pre-Deep-Learning
From roughly 1995 to 2012, SVMs were the go-to method for classification:
- Convex optimization: a single global optimum, no local minima
- Strong theory: margin-based generalization bounds
- Kernels: nonlinear classification without manual feature engineering
- Sparsity: the solution depends on few support vectors
Deep learning overtook SVMs because neural networks scale better to massive datasets (SGD is per epoch vs - for kernel SVMs) and learn hierarchical features automatically.
Summary
- Hard margin SVM: minimize subject to
- Margin = ; maximizing margin = minimizing
- Dual depends on data only through dot products kernel trick
- Support vectors: points with , lying on the margin
- Soft margin: slack variables , parameter controls tradeoff
- Hinge loss : convex surrogate for 0-1 loss
Exercises
Problem
Derive the dual of the hard margin SVM. Start from the primal subject to , introduce Lagrange multipliers , and eliminate and .
Problem
Show that for the soft margin SVM, the dual variables satisfy . What does mean geometrically?
Related Comparisons
References
Canonical:
- Vapnik, The Nature of Statistical Learning Theory (1995), Chapters 5 and 10
- Cristianini & Shawe-Taylor, An Introduction to Support Vector Machines (2000)
Current:
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 15-16
-
Boyd & Vandenberghe, Convex Optimization (2004), for duality theory
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
The natural next step from SVMs:
- Kernels and RKHS: the full theory of reproducing kernel Hilbert spaces
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Kernels and Reproducing Kernel Hilbert SpacesLayer 3
- Rademacher ComplexityLayer 3
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- VC DimensionLayer 2