Support Vector Machines

Sneiderman, Robby

ML Methods

Support Vector Machines

Maximum-margin classifiers via convex optimization: hard margin, soft margin with slack variables, hinge loss, the dual formulation, and the kernel trick.

CoreTier 1StableCore spine~70 min

Prerequisites

Convex Optimization Basics Convex Duality Logistic Regression Perceptron

Start 8-question practice · 8 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 1. This page has 5 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Kernels and Reproducing Kernel Hilbert Spaces

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Maximum-margin hyperplane $w^\top x + b = 0$ separating two classes. Filled markers are support vectors (points on the margin gutters with $\alpha_i > 0$). One soft-margin violator is shown.

Support vector machines were the dominant classification method from the mid-1990s through the early 2010s, and for good reason: they have a clean mathematical formulation rooted in convex optimization, come with strong generalization guarantees via margin theory, and extend to nonlinear classification through the kernel trick. Even in the deep learning era, SVMs remain the clearest example of how optimization geometry connects to statistical generalization.

Mental Model

You have two classes of points in $\mathbb{R}^d$ . You want to find a separating hyperplane that is as far as possible from both classes. The "margin" is the width of the gap between the classes. Maximizing the margin is equivalent to minimizing $\|w\|$ subject to all points being correctly classified. This is a convex quadratic program, and its dual reveals that the solution depends only on dot products between data points — opening the door to kernels.

Hard Margin SVM

Definition

Separating Hyperplane $w^{T} x + b = 0$

A hyperplane in $\mathbb{R}^d$ defined by weight vector $w \in \mathbb{R}^d$ and bias $b \in \mathbb{R}$ . A point $x_i$ with label $y_i \in \{-1, +1\}$ is correctly classified if and only if $y_i(w^T x_i + b) > 0$ .

Definition

Functional Margin $y_{i} (w^{T} x_{i} + b)$

The functional margin of a point $(x_i, y_i)$ is $y_i(w^T x_i + b)$ . Correct classification means positive functional margin. The geometric margin is the functional margin divided by $\|w\|$ , giving the actual Euclidean distance from the point to the hyperplane.

Definition

Hard Margin SVM

For linearly separable data $\{(x_i, y_i)\}_{i=1}^n$ , the hard margin SVM solves:

$\min_{w, b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^T x_i + b) \geq 1 \;\; \forall i$

The margin (distance between the two class boundaries) is $\frac{2}{\|w\|}$ . Maximizing this margin is equivalent to minimizing $\|w\|^2$ .

The Dual Formulation

Theorem

Hard Margin SVM Dual Problem

Statement

The Lagrangian dual of the hard margin SVM is:

$\max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j$

subject to $\alpha_i \geq 0$ for all $i$ and $\sum_i \alpha_i y_i = 0$ .

At optimality, $w^* = \sum_i \alpha_i^* y_i x_i$ , and $\alpha_i^* > 0$ only for support vectors: points lying exactly on the margin boundary.

Intuition

The dual says: find weights $\alpha_i$ on training points such that the resulting classifier $w = \sum \alpha_i y_i x_i$ maximizes margin. Most $\alpha_i$ are zero. Only the points closest to the decision boundary (the support vectors) have nonzero weight, so the classifier depends on only a few critical training examples.

Proof Sketch

Form the Lagrangian $L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_i \alpha_i[y_i(w^T x_i + b) - 1]$ . Set $\nabla_w L = 0$ to get $w = \sum_i \alpha_i y_i x_i$ . Set $\partial L / \partial b = 0$ to get $\sum_i \alpha_i y_i = 0$ . Substitute back into $L$ to eliminate $w$ and $b$ , yielding the dual. Strong duality holds because the primal is convex and Slater's condition is satisfied (any strictly feasible point suffices).

Why It Matters

The dual formulation is essential for two reasons: (1) the data appears only through dot products $x_i^T x_j$ , enabling the kernel trick, and (2) the dual variables directly identify the support vectors.

Failure Mode

The hard margin SVM has no solution if the data is not linearly separable. Even a single mislabeled point or outlier makes the primal infeasible. This motivates the soft margin formulation.

report a correction →

Lagrange Multipliers and KKT — Quick Recap

SVM is the canonical setting for seeing Lagrange multipliers and the KKT conditions in action. A short self-contained refresher, since these concepts do not have standalone topic pages yet.

A constrained convex problem $\min_x f(x) \text{ s.t. } g_i(x) \le 0$ has Lagrangian $L(x, \alpha) = f(x) + \sum_i \alpha_i g_i(x)$ with $\alpha_i \ge 0$ . The dual function $g(\alpha) = \min_x L(x, \alpha)$ is concave; its supremum equals $f^*$ at the primal optimum under regularity (Slater's condition: a strictly feasible point exists). The four KKT conditions characterize that joint optimum:

Stationarity (gradient of Lagrangian vanishes): $\nabla_x f(x^*) + \sum_i \alpha_i^* \nabla g_i(x^*) = 0$
Primal feasibility: $g_i(x^*) \le 0$ for every $i$
Dual feasibility: $\alpha_i^* \ge 0$ for every $i$
Complementary slackness: $\alpha_i^* g_i(x^*) = 0$ for every $i$ — each constraint is either tight ( $g_i(x^*) = 0$ ) or its multiplier is zero ( $\alpha_i^* = 0$ ).

For the hard-margin SVM with constraints $g_i(w, b) = 1 - y_i(w^\top x_i + b) \le 0$ , the KKT conditions specialize to:

Stationarity: $w^* = \sum_i \alpha_i^* y_i x_i$ (and $\sum_i \alpha_i^* y_i = 0$ from $\partial L / \partial b$ ).
Primal feasibility: $y_i(w^{*\top} x_i + b^*) \ge 1$ for every training point.
Dual feasibility: $\alpha_i^* \ge 0$ .
Complementary slackness: $\alpha_i^*[y_i(w^{*\top} x_i + b^*) - 1] = 0$ — either $\alpha_i^* = 0$ (the point is not a support vector) or $y_i(w^{*\top} x_i + b^*) = 1$ (the point sits exactly on the margin gutter). No middle ground.

The dual-variable interpretation is what makes SVM both interpretable and sparse: only the support vectors contribute to $w^*$ . The complementary-slackness identity is the operational rule that picks them out.

Soft Margin SVM and Hinge Loss

Definition

Soft Margin SVM $ξ_{i}$

When data is not linearly separable, introduce slack variables $\xi_i \geq 0$ :

$\min_{w, b, \xi} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^n \xi_i \quad \text{s.t.} \quad y_i(w^T x_i + b) \geq 1 - \xi_i, \;\; \xi_i \geq 0$

The parameter $C > 0$ trades off margin width against margin violations. Large $C$ penalizes violations heavily (close to hard margin); small $C$ allows more violations for a wider margin.

Definition

Hinge Loss $max (0, 1 - y \cdot f (x))$

The soft margin SVM is equivalent to minimizing the hinge loss:

$\min_w \frac{\lambda}{2}\|w\|^2 + \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i(w^T x_i + b))$

where $\lambda = 1/(nC)$ . The hinge loss is convex, non-differentiable at $y \cdot f(x) = 1$ , and zero for correctly classified points with margin $\geq 1$ . This is a regularized ERM problem with hinge loss.

The Kernel Trick

Since the dual depends on data only through dot products $x_i^T x_j$ , we can replace these with a kernel function $k(x_i, x_j) = \phi(x_i)^T \phi(x_j)$ where $\phi$ maps to a (potentially infinite-dimensional) feature space. The classifier becomes:

$f(x) = \sum_{i=1}^n \alpha_i y_i k(x_i, x) + b$

We never compute $\phi(x)$ explicitly. We only evaluate the kernel. Four canonical kernels cover the bulk of practice:

Kernel	Formula	Feature-space dimension	Where it earns its keep
Linear	$k(x, z) = x^\top z$	$d$ (no lift)	High-dimensional sparse data (text, bag-of-words); when interpretability of $w$ matters.
Polynomial (degree $p$ )	$k(x, z) = (x^\top z + c)^p$	$\binom{d + p}{p}$	Explicit interaction terms up to degree $p$ ; small $d$ with structured nonlinearity.
Gaussian RBF	$k(x, z) = \exp(-\\|x - z\\|^2 / (2\sigma^2))$	infinite	Default for most problems; smooth boundaries; bandwidth $\sigma$ is the dominant tuning knob.
Sigmoid (tanh)	$k(x, z) = \tanh(\kappa\, x^\top z + c)$	not p.s.d. for all $\kappa, c$	Historical neural-network analogy; rarely best in practice and not always Mercer.

The RBF kernel maps to an infinite-dimensional feature space yet computes in $O(d)$ per pair, which is what made the kernel trick a practical revolution. The kernels and RKHS page gives the full Mercer / reproducing-kernel theory underneath.

Sensitivity to $C$ and $\sigma$

SVM performance is famously parameter-sensitive. Two knobs dominate, and miscalibrating either can flip the model from underfit to overfit:

$C$ (soft-margin penalty). Large $C$ heavily penalizes margin violations; the model approaches the hard-margin SVM, narrowing the margin and overfitting outliers. Small $C$ tolerates violations, widening the margin and behaving like a heavily regularized classifier (potentially underfitting). The mapping $\lambda = 1 / (n C)$ ties $C$ to the regularization strength of the equivalent hinge-loss-plus- $\ell_2$ ERM.
$\sigma$ (RBF bandwidth). Large $\sigma$ makes $k(x, z)$ vary slowly with distance, producing smooth, low-capacity decision functions (underfit). Small $\sigma$ makes $k(x, z)$ near-zero for any pair of distinct points, producing a near-nearest-neighbor classifier with high capacity (overfit; the algorithmic stability bound becomes loose).

The standard discipline is grid search over $\log C$ and $\log \sigma$ with cross-validation, evaluated on a hold-out set. The classical sweet spot for RBF SVMs on a normalized dataset is $C \in [10^{-1}, 10^{2}]$ and $\sigma \approx \text{median pairwise distance}$ ("median heuristic"). When the median heuristic disagrees sharply with the cross-validated optimum, the data has multi-scale structure that single-bandwidth RBF cannot capture; that is when polynomial or composite kernels earn their place.

Bousquet and Elisseeff (2002, Theorem 22) gave SVM an explicit stability bound $\beta = L^2 / (2\mu n)$ , where $\mu$ is the strong-convexity constant of the regularizer (here $2/C$ ) and $L$ is the Lipschitz constant of the loss. This is one of the cleanest examples of a stability-based generalization bound in the canonical-algorithm setting, and the rate $1/(\mu n)$ scales linearly with $C$ .

Theorem

Representer Theorem for SVMs

Statement

For any regularized risk minimization problem of the form $\min_{f \in \mathcal{H}_k} \frac{\lambda}{2}\|f\|_{\mathcal{H}_k}^2 + \frac{1}{n}\sum_i \ell(y_i, f(x_i))$ , the minimizer has the representation $f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ .

Intuition

You never need to search over the full (infinite-dimensional) RKHS. The optimal function is always a finite linear combination of kernel evaluations at the training points. This is why the kernel trick is computationally feasible.

Proof Sketch

Decompose $f = f_\parallel + f_\perp$ where $f_\parallel$ lies in $\text{span}\{k(x_i, \cdot)\}$ and $f_\perp$ is orthogonal. Then $f(x_i) = f_\parallel(x_i)$ for all training points (by reproducing property), but $\|f\|^2 = \|f_\parallel\|^2 + \|f_\perp\|^2 \geq \|f_\parallel\|^2$ . So $f_\perp$ only increases the regularizer without affecting the loss.

Why It Matters

The representer theorem reduces an infinite-dimensional optimization to a finite-dimensional one (finding $n$ coefficients $\alpha_i$ ), making kernel methods computationally tractable.

Failure Mode

The kernel matrix $K_{ij} = k(x_i, x_j)$ is $n \times n$ . For large $n$ , storing and inverting this matrix becomes prohibitive ( $O(n^2)$ memory, $O(n^3)$ solve time). This is why SVMs do not scale as well as SGD-based methods to very large datasets.

report a correction →

Connection to Regularization

The soft margin SVM is equivalent to regularized ERM with hinge loss:

$\min_w \frac{\lambda}{2}\|w\|^2 + \frac{1}{n}\sum_{i=1}^n \ell_{\text{hinge}}(y_i, w^T x_i)$

This is the same framework as ridge regression (squared loss + L2 regularization) or regularized logistic regression (logistic loss + L2). The only difference is the loss function. The hinge loss is special because it creates sparsity in the dual: most $\alpha_i = 0$ , giving the "support vector" property.

Canonical Examples

Example

Hard margin in 2D

Consider two classes: $\{(1,1), (2,2)\}$ with label $+1$ and $\{(-1,-1), (-2,-2)\}$ with label $-1$ . The optimal separating hyperplane passes through the origin with $w \propto (1,1)$ . The margin is $\frac{2}{\|w\|}$ and the support vectors are $(1,1)$ and $(-1,-1)$ , the closest points to the boundary from each class.

Common Confusions

Watch Out

SVMs do not maximize accuracy. THEY maximize margin

An SVM does not directly maximize classification accuracy on the training set. Many hyperplanes may achieve 100% training accuracy on separable data. The SVM picks the one with the largest margin. The theoretical justification is that large margin implies small generalization error (via VC dimension bounds on the margin classifier class).

Watch Out

The kernel trick is not specific to SVMs

Any algorithm whose computation depends only on dot products between data points can be kernelized. This includes kernel PCA, kernel ridge regression, and kernel k-means. SVMs popularized the kernel trick, but it is a general principle.

Why SVMs Were Dominant Pre-Deep-Learning

From roughly 1995 to 2012, SVMs were the go-to method for classification:

Convex optimization: a single global optimum, no local minima
Strong theory: margin-based generalization bounds
Kernels: nonlinear classification without manual feature engineering
Sparsity: the solution depends on few support vectors

Deep learning overtook SVMs because neural networks scale better to massive datasets (SGD is $O(n)$ per epoch vs $O(n^2)$ - $O(n^3)$ for kernel SVMs) and learn hierarchical features automatically.

Summary

Hard margin SVM: minimize $\|w\|^2$ subject to $y_i(w^T x_i + b) \geq 1$
Margin = $2/\|w\|$ ; maximizing margin = minimizing $\|w\|$
Dual depends on data only through dot products $\to$ kernel trick
Support vectors: points with $\alpha_i > 0$ , lying on the margin
Soft margin: slack variables $\xi_i$ , parameter $C$ controls tradeoff
Hinge loss $\max(0, 1 - yf(x))$ : convex surrogate for 0-1 loss

Optional Deeper DetailMulticlass SVM: one-vs-rest, one-vs-one, and the Crammer-Singer joint formulationShow

Advanced section adapted from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2nd ed., Springer 2009), §12.3.4 "Support Vector Machines for Regression" and §12.7 "Multiple-Class SVMs," pp. 433-435, plus Crammer, K. and Singer, Y. (2002), "On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines," JMLR 2, 265-292.

The standard SVM formulation is binary. Real classification problems usually have $C > 2$ classes. ESL §12.7 catalogues the three approaches; only the third has a clean joint margin theory.

Approach 1: One-vs-rest (OvR). Train $C$ binary SVMs, the $c$ -th of which separates class $c$ from all other classes. At prediction time, return $\hat y(x) = \arg\max_c f_c(x)$ where $f_c$ is the decision function of the $c$ -th binary SVM. Simple to implement; loses some efficiency because each binary problem is unbalanced (one class vs the union of all others). Class-imbalance pathologies of the binary SVM (especially with soft margin under skewed $C$ ) compound.

Approach 2: One-vs-one (OvO). Train $\binom{C}{2}$ binary SVMs, one for each class pair. At prediction time, each SVM votes for one of the two classes it discriminates; the class with the most votes wins. More balanced subproblems (each is a true binary problem on a subset of data) but $O(C^2)$ models and $O(C^2)$ prediction cost. Used in libsvm's default multiclass mode.

Approach 3: Crammer-Singer joint formulation. Solve a single optimization problem that jointly fits $C$ score functions $f_1, \ldots, f_C$ with the constraint that the margin between the correct class score and every wrong class score is at least 1:

$\min_{w_1, \ldots, w_C, \xi} \;\; \frac{1}{2} \sum_{c=1}^C \| w_c \|^2 + C \sum_{i=1}^n \xi_i$

subject to

$w_{y_i}^\top x_i - w_c^\top x_i \;\ge\; 1 - \xi_i \quad \text{for all } c \ne y_i, \;\; \xi_i \ge 0.$

The slack $\xi_i$ is the deficit of the correct-class score relative to the worst wrong-class score plus 1. Setting $\xi_i = 0$ for every $i$ recovers a hard-margin multiclass classifier when the data is linearly separable in the multiclass sense (all correct classes beat all wrong classes by at least margin 1).

Loss-function view. Rewrite the Crammer-Singer constraint by eliminating $\xi_i$ :

$\xi_i \;=\; \max\bigl\{0, \;\; 1 + \max_{c \ne y_i} w_c^\top x_i - w_{y_i}^\top x_i\bigr\}.$

The objective becomes a sum of multiclass-hinge losses

$L_{\text{multi-hinge}}(\mathbf w, x_i, y_i) \;=\; \max\bigl\{0, \;\; 1 + \max_{c \ne y_i} w_c^\top x_i - w_{y_i}^\top x_i\bigr\}.$

This is the natural generalization of binary hinge: the loss is positive whenever some wrong-class score comes within 1 of the correct-class score. The convex envelope of 0-1 multiclass loss.

Comparison to softmax cross-entropy. Multinomial logistic regression uses a different convex surrogate:

$L_{\text{CE}}(\mathbf w, x_i, y_i) \;=\; -\log \frac{\exp(w_{y_i}^\top x_i)}{\sum_{c} \exp(w_c^\top x_i)}.$

Both are Fisher-consistent (their population minimizers select the correct Bayes class), but they have different geometric behavior:

Multiclass hinge zeros out the loss as soon as the correct class beats every other by margin 1. Hard margin.
Cross-entropy never zeros out: it always rewards larger correct-class score even past the margin.

This is the multiclass version of the binary hinge-vs-logistic comparison: hinge gives sparse support (training points near the boundary matter, the rest do not) while cross-entropy gives smooth, dense gradients useful for neural-network training. The Crammer-Singer formulation is the right multiclass SVM but is rarely competitive with softmax cross-entropy at scale, which is why deep multiclass classifiers default to cross-entropy.

Implementation. The Crammer-Singer dual is a single QP with $nC$ variables (one $\alpha_{i,c}$ per data point per class) and the constraint $\sum_c \alpha_{i,c} = 0$ . Solving it directly is expensive; specialized algorithms (Keerthi-Sundararajan-Chang-Hsieh-Lin 2008, sequential dual methods) handle moderately large problems. For very large $C$ , the OvR or OvO heuristics are usually preferred despite their weaker theoretical foundation.

Exercises

ExerciseCore

Problem

Derive the dual of the hard margin SVM. Start from the primal $\min_{w,b} \frac{1}{2}\|w\|^2$ subject to $y_i(w^T x_i + b) \geq 1$ , introduce Lagrange multipliers $\alpha_i \geq 0$ , and eliminate $w$ and $b$ .

ExerciseAdvanced

Problem

Show that for the soft margin SVM, the dual variables satisfy $0 \leq \alpha_i \leq C$ . What does $\alpha_i = C$ mean geometrically?

Related Comparisons

SVM vs. Logistic Regression

References

Canonical:

Vapnik, The Nature of Statistical Learning Theory (1995), Chapters 5 and 10
Cristianini & Shawe-Taylor, An Introduction to Support Vector Machines (2000)
Cortes & Vapnik, "Support-Vector Networks" (Machine Learning, 1995)
Boser, Guyon, Vapnik, "A Training Algorithm for Optimal Margin Classifiers" (COLT, 1992)
Schoelkopf & Smola, Learning with Kernels (2002), Chapters 2 and 7

Current:

Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 15-16
Boyd & Vandenberghe, Convex Optimization (2004), Chapter 5
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. §12.2 (the support vector classifier, separable case), §12.3 (SVMs and kernels), §12.3.1 (computing the SVM via the dual), §12.3.2 (the SVM as penalized hinge loss), §12.3.4 (support vector regression), §12.7 (multiple-class SVMs).
Steinwart & Christmann, Support Vector Machines (2008), Chapters 1 and 2

Next Topics

The natural next step from SVMs:

Kernels and RKHS: the full theory of reproducing kernel Hilbert spaces

Last reviewed: May 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Convex Optimization Basicslayer 1 · tier 1
Logistic Regressionlayer 1 · tier 1
Convex Dualitylayer 2 · tier 1
Loss Functionslayer 1 · tier 2
Perceptronlayer 1 · tier 2

Derived topics

3

The Kernel Tricklayer 2 · tier 1
Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
SVM for RF Classificationlayer 4 · tier 3

Graph-backed continuations

Kernels and Reproducing Kernel Hilbert Spaces SVM for RF Classification The Kernel Trick

Read this page in the graph.

Why This Matters

Mental Model

Hard Margin SVM

The Dual Formulation

Lagrange Multipliers and KKT — Quick Recap

Soft Margin SVM and Hinge Loss

The Kernel Trick

Sensitivity to CCC and σ\sigmaσ

Connection to Regularization

Canonical Examples

Common Confusions

Why SVMs Were Dominant Pre-Deep-Learning

Summary

Exercises

Related Comparisons

References

Next Topics

Required before and derived from this topic

Required prerequisites

Derived topics

Sensitivity to $C$ and $\sigma$