Skip to main content

ML Methods

The Kernel Trick

Any algorithm whose computation accesses data only through inner products can be lifted to a feature space implicitly defined by a kernel, without ever computing the features. Worked polynomial-kernel inner-product equivalence, infinite-feature interpretation of the Gaussian RBF, Mercer's condition, and the four canonical applications: kernel SVM, kernel ridge regression, kernel PCA, kernel k-means.

CoreTier 1StableCore spine~60 min

Why This Matters

The kernel trick is the move that turns linear algorithms into nonlinear ones at no algorithmic cost. Take any method that computes only with inner products xi,xj\langle x_i, x_j\rangle (SVM, ridge regression, PCA, k-means, perceptron) and replace those inner products with a kernel function k(xi,xj)k(x_i, x_j) that secretly computes inner products in some higher-dimensional feature space. The algorithm runs unchanged; it just thinks in a richer geometry.

The trick has two practical wins and one theoretical payoff:

  • Practical: a polynomial kernel of degree pp in Rd\mathbb R^d computes in O(d)O(d) time, but operates as if the data had been lifted to (d+pp)\binom{d+p}{p} features. The Gaussian RBF lifts to an infinite-dimensional space and still computes in O(d)O(d) time per pair.
  • Theoretical: Mercer's condition tells you exactly which kk's are inner products in some space (positive semi-definite ones), and the RKHS theory gives a precise account of the function space those kernels span.

This page focuses on the trick itself with the algebra worked out end-to-end. For the full theory of reproducing kernel Hilbert spaces, see kernels and RKHS.

Mental Model

A linear classifier in Rd\mathbb R^d separates points with a hyperplane wx=0w^\top x = 0. If the data is not linearly separable, you can lift it: define ϕ:RdRD\phi: \mathbb R^d \to \mathbb R^D for some larger DD, classify in feature space with wϕ(x)=0w^\top \phi(x) = 0, and project back. The new decision boundary in Rd\mathbb R^d is nonlinear.

Two problems with the explicit lift:

  1. DD can be huge (or infinite for some useful ϕ\phi).
  2. Computing ϕ(x)\phi(x) for every training point costs O(D)O(D), and the algorithm needs O(nD)O(nD) inner products at minimum.

The kernel trick observes: many algorithms never need ϕ(x)\phi(x) itself. They only need pairwise inner products ϕ(xi)ϕ(xj)\phi(x_i)^\top \phi(x_j). If we can compute k(xi,xj)=ϕ(xi)ϕ(xj)k(x_i, x_j) = \phi(x_i)^\top \phi(x_j) directly from xix_i and xjx_j, without constructing ϕ(xi)\phi(x_i) or ϕ(xj)\phi(x_j), we get the nonlinear lift for free.

The polynomial kernel k(x,x)=(xx+c)pk(x, x') = (x^\top x' + c)^p is the canonical illustration. Expanding (xx)2(x^\top x')^2 in R2\mathbb R^2 gives back exactly the inner product of two specific 3-dimensional vectors, at O(d)O(d) cost instead of O(d2)O(d^2).

Formal Setup

Definition

The Kernel Trick

A learning algorithm is kernelizable if every operation involving inputs xiXx_i \in \mathcal X can be written using only inner products xi,xj\langle x_i, x_j \rangle (or, equivalently, pairwise squared distances xixj2\|x_i - x_j\|^2). Given a kernel k:X×XRk: \mathcal X \times \mathcal X \to \mathbb R that is a symmetric, positive semi-definite function, the kernel trick replaces every xi,xj\langle x_i, x_j\rangle in the algorithm with k(xi,xj)k(x_i, x_j).

The result is an algorithm that operates as if it had access to the lifted features ϕ(x)\phi(x), where ϕ\phi is the (possibly infinite-dimensional) feature map associated with kk by Mercer's theorem, without ever computing ϕ(x)\phi(x) explicitly.

The kernel trick is a meta-pattern, not an algorithm. You can kernelize SVM, ridge regression, PCA, k-means, perceptron, k-nearest-neighbors, and more, each producing a "kernel version" of the original method.

The Polynomial Kernel: Worked Inner-Product Equivalence

This is the worked algebra that makes the trick concrete. Take X=R2\mathcal X = \mathbb R^2 and the degree-2 polynomial kernel

k(x,x)=(xx)2.k(x, x') = (x^\top x')^2.

Claim. k(x,x)=ϕ(x)ϕ(x)k(x, x') = \phi(x)^\top \phi(x') where ϕ:R2R3\phi: \mathbb R^2 \to \mathbb R^3 is

ϕ(x)=(x122x1x2x22).\phi(x) = \begin{pmatrix} x_1^2 \\ \sqrt 2 \, x_1 x_2 \\ x_2^2 \end{pmatrix}.

Proof by direct expansion.

k(x,x)=(xx)2=(x1x1+x2x2)2=(x1x1)2+2(x1x1)(x2x2)+(x2x2)2.k(x, x') = (x^\top x')^2 = (x_1 x'_1 + x_2 x'_2)^2 = (x_1 x'_1)^2 + 2 (x_1 x'_1)(x_2 x'_2) + (x_2 x'_2)^2.

Recombine:

=x12(x1)2+2x1x2x1x2+x22(x2)2.= x_1^2 (x'_1)^2 + 2 x_1 x_2 x'_1 x'_2 + x_2^2 (x'_2)^2.

Now match this against the inner product of the proposed feature vectors:

ϕ(x)ϕ(x)=x12(x1)2+22x1x2x1x2+x22(x2)2=x12(x1)2+2x1x2x1x2+x22(x2)2.\phi(x)^\top \phi(x') = x_1^2 (x'_1)^2 + \sqrt 2 \cdot \sqrt 2 \, x_1 x_2 \, x'_1 x'_2 + x_2^2 (x'_2)^2 = x_1^2 (x'_1)^2 + 2 x_1 x_2 x'_1 x'_2 + x_2^2 (x'_2)^2.

These are identical. The 2\sqrt 2 factor in ϕ\phi is exactly what produces the 22 in the cross-term of the kernel expansion. The trick is computing k(x,x)=(x1x1+x2x2)2k(x, x') = (x_1 x'_1 + x_2 x'_2)^2 in O(d)=O(2)O(d) = O(2) operations (one inner product and one squaring), while the explicit feature inner product ϕ(x)ϕ(x)\phi(x)^\top \phi(x') would cost O(D)=O(3)O(D) = O(3). The asymptotics matter when dd is larger.

Theorem

Polynomial Kernel Equals Inner Product in a Polynomial Feature Space

Statement

For x,xRdx, x' \in \mathbb R^d, the polynomial kernel k(x,x)=(xx+c)pk(x, x') = (x^\top x' + c)^p with c0c \ge 0 and integer p1p \ge 1 satisfies k(x,x)=ϕp(x)ϕp(x)k(x, x') = \phi_p(x)^\top \phi_p(x'), where ϕp:RdRD\phi_p: \mathbb R^d \to \mathbb R^D with D=(d+pp)D = \binom{d + p}{p} is the feature map containing all monomials in x1,,xd,cx_1, \dots, x_d, \sqrt c of total degree exactly pp, each weighted by the square root of its multinomial coefficient.

So computing k(x,x)k(x, x') costs O(d)O(d) (one inner product plus a power), while explicit features cost O(D)O(D). For d=100d = 100 and p=4p = 4, D=(1044)4.6×106D = \binom{104}{4} \approx 4.6 \times 10^6, three orders of magnitude larger than dd.

Intuition

A polynomial of degree p\le p in dd variables has (d+pp)\binom{d+p}{p} monomials. The polynomial kernel implicitly represents the full degree-pp polynomial space, so a linear classifier in feature space becomes a degree-pp polynomial classifier in Rd\mathbb R^d. The c\sqrt c trick (interpreting the constant cc as a fictitious feature) is what lets all degrees up to pp appear simultaneously, not just degree pp.

Proof Sketch

Use the multinomial theorem on (x1x1+x2x2++xdxd+c)p(x_1 x'_1 + x_2 x'_2 + \cdots + x_d x'_d + c)^p. Each term is of the form (pα1,α2,,αd+1)(x1x1)α1(xdxd)αdcαd+1\binom{p}{\alpha_1, \alpha_2, \dots, \alpha_{d+1}} (x_1 x'_1)^{\alpha_1} \cdots (x_d x'_d)^{\alpha_d} c^{\alpha_{d+1}} with αi=p\sum \alpha_i = p. Rewrite the multinomial coefficient as (pα)=((pα))2\binom{p}{\alpha} = (\sqrt{\binom{p}{\alpha}})^2 and absorb a (pα)\sqrt{\binom{p}{\alpha}} into each of the two product factors. Then the term equals ((pα)x1α1xdαdcαd+1/2)((pα)(x1)α1(xd)αdcαd+1/2)(\sqrt{\binom{p}{\alpha}} x_1^{\alpha_1} \cdots x_d^{\alpha_d} c^{\alpha_{d+1}/2}) \cdot (\sqrt{\binom{p}{\alpha}} (x'_1)^{\alpha_1} \cdots (x'_d)^{\alpha_d} c^{\alpha_{d+1}/2}), which is the product of two coordinates of ϕp(x)\phi_p(x) and ϕp(x)\phi_p(x'). Summing over all multi-indices reconstructs (xx+c)p(x^\top x' + c)^p.

Why It Matters

This is the concrete demonstration of the kernel trick. Every other kernel-trick claim ultimately reduces to "a kernel computes an inner product in a feature space implicitly," and the polynomial kernel is where the algebra is fully visible. Once you have seen it once, the Gaussian RBF (below, infinite-dimensional features) and the abstract Mercer theory feel like extensions rather than mysteries.

Failure Mode

The polynomial kernel of degree pp produces features that grow combinatorially with dd and pp. For very large pp, the kernel matrix can become ill-conditioned (high-degree monomials have wildly different scales), and numerical inversion of K+λIK + \lambda I in kernel ridge regression suffers. Practical use: p4p \le 4 or so for most tasks, with feature normalization before applying kk. For higher nonlinearity, the Gaussian RBF is usually a better choice.

The Gaussian RBF Kernel: Infinite Feature Space

The Gaussian kernel

k(x,x)=exp ⁣(xx22σ2)k(x, x') = \exp\!\left(-\frac{\|x - x'\|^2}{2\sigma^2}\right)

also corresponds to an inner product in a feature space, except that feature space is infinite-dimensional. The cleanest way to see this is via the Taylor expansion of the exponential. Expand xx2/(2σ2)-\|x - x'\|^2/(2\sigma^2) as x2/(2σ2)x2/(2σ2)+xx/σ2-\|x\|^2/(2\sigma^2) - \|x'\|^2/(2\sigma^2) + x^\top x'/\sigma^2 and factor:

k(x,x)=ex2/(2σ2)ex2/(2σ2)exx/σ2.k(x, x') = e^{-\|x\|^2/(2\sigma^2)} e^{-\|x'\|^2/(2\sigma^2)} \cdot e^{x^\top x'/\sigma^2}.

Now Taylor-expand the second exponential:

exx/σ2=m=0(xx)mm!σ2m=m=01m!σ2mϕm(x),ϕm(x)e^{x^\top x'/\sigma^2} = \sum_{m=0}^\infty \frac{(x^\top x')^m}{m! \, \sigma^{2m}} = \sum_{m=0}^\infty \frac{1}{m! \, \sigma^{2m}} \langle \phi_m(x), \phi_m(x') \rangle

where ϕm\phi_m is the degree-mm polynomial feature map from the previous section. So

k(x,x)=ex2/(2σ2)(σ20ϕ0(x)/0!σ21ϕ1(x)/1!σ22ϕ2(x)/2!),    ex2/(2σ2)(σ20ϕ0(x)/0!σ21ϕ1(x)/1!).k(x, x') = \left\langle e^{-\|x\|^2/(2\sigma^2)} \begin{pmatrix} \sigma^{-2 \cdot 0} \phi_0(x)/\sqrt{0!} \\ \sigma^{-2 \cdot 1} \phi_1(x)/\sqrt{1!} \\ \sigma^{-2 \cdot 2} \phi_2(x)/\sqrt{2!} \\ \vdots \end{pmatrix}, \;\; e^{-\|x'\|^2/(2\sigma^2)} \begin{pmatrix} \sigma^{-2 \cdot 0} \phi_0(x')/\sqrt{0!} \\ \sigma^{-2 \cdot 1} \phi_1(x')/\sqrt{1!} \\ \vdots \end{pmatrix} \right\rangle.

The feature vector is infinite-dimensional, with one block per polynomial degree, weighted by 1/m!1/\sqrt{m!} and modulated by the Gaussian envelope. Computing this feature vector explicitly is impossible; computing the kernel takes O(d)O(d) operations.

The Gaussian RBF is called universal (Steinwart 2001): its associated RKHS is dense in C(X)C(\mathcal X) for any compact X\mathcal X. So linear methods in the Gaussian RBF feature space can approximate any continuous function uniformly. This is the theoretical reason RBF SVMs are the default choice when you don't know much about your data.

Mercer's Condition

For a function k(x,x)k(x, x') to be a valid kernel (an inner product in some feature space), it must be symmetric and positive semi-definite.

Theorem

Mercer's Condition for Validity of a Kernel

Statement

A symmetric function k:X×XRk: \mathcal X \times \mathcal X \to \mathbb R is a kernel (i.e. there exists a feature map ϕ\phi with k(x,x)=ϕ(x),ϕ(x)k(x, x') = \langle \phi(x), \phi(x')\rangle) if and only if it is positive semi-definite: for every finite collection of points x1,,xnx_1, \dots, x_n and coefficients α1,,αnR\alpha_1, \dots, \alpha_n \in \mathbb R,

i,j=1nαiαjk(xi,xj)    0.\sum_{i,j=1}^n \alpha_i \alpha_j k(x_i, x_j) \;\ge\; 0.

Equivalently, the Gram matrix KRn×nK \in \mathbb R^{n \times n} with entries Kij=k(xi,xj)K_{ij} = k(x_i, x_j) is positive semi-definite for every finite sample.

Intuition

If k(x,x)=ϕ(x)ϕ(x)k(x, x') = \phi(x)^\top \phi(x'), then for any α\alpha, ijαiαjk(xi,xj)=ijαiαjϕ(xi)ϕ(xj)=iαiϕ(xi)20\sum_{ij} \alpha_i \alpha_j k(x_i, x_j) = \sum_{ij} \alpha_i \alpha_j \phi(x_i)^\top \phi(x_j) = \|\sum_i \alpha_i \phi(x_i)\|^2 \ge 0. The PSD property is automatic for inner products. Mercer's theorem says the converse: every symmetric PSD function arises as an inner product in some Hilbert space (the RKHS).

Proof Sketch

The forward direction (kernel implies PSD) follows from the inner-product factorization. The reverse direction is Mercer's theorem: a continuous PSD kernel kk on a compact domain has a spectral decomposition k(x,x)=jλjϕj(x)ϕj(x)k(x, x') = \sum_j \lambda_j \phi_j(x) \phi_j(x') with λj0\lambda_j \ge 0 and orthonormal ϕj\phi_j. The feature map is ϕ(x)=(λ1ϕ1(x),λ2ϕ2(x),)\phi(x) = (\sqrt{\lambda_1} \phi_1(x), \sqrt{\lambda_2} \phi_2(x), \dots), and the Hilbert space of square-summable sequences gives the feature space. The full proof uses the spectral theorem for compact self-adjoint operators; see kernels and RKHS for details.

Why It Matters

Mercer's condition is the practical test for whether a similarity function you've defined can be plugged into a kernel SVM, kernel ridge regression, or any other kernel method. Many domain-specific similarities (string kernels, graph kernels, sequence-alignment kernels) fail PSD-ness, at which point the algorithm becomes ill-defined: the kernel SVM dual may have no solution, or kernel PCA may have negative eigenvalues that the method can't interpret. PSD-ness is the floor below which kernel methods break.

Failure Mode

The hyperbolic tangent kernel k(x,x)=tanh(κxx+c)k(x, x') = \tanh(\kappa x^\top x' + c) is not PSD for all κ,c\kappa, c. Many SVM libraries accept it anyway, because the resulting optimization is still well-defined (the SVM is solved via a QP that does not strictly require PSD-ness of KK). The downside is that the optimization can fail or return solutions that don't generalize. The "sigmoid kernel" works in practice for some problems but loses the theoretical guarantees that PSD kernels carry.

Four Canonical Applications

The same trick, four different host algorithms:

Kernel SVM

The SVM dual depends on data only through dot products xixjx_i^\top x_j:

maxαiαi12ijαiαjyiyjxixjs.t.αi0,    iαiyi=0.\max_\alpha \sum_i \alpha_i - \tfrac12 \sum_{ij} \alpha_i \alpha_j y_i y_j \, x_i^\top x_j \quad \text{s.t.} \quad \alpha_i \ge 0, \;\; \sum_i \alpha_i y_i = 0.

Replace xixjx_i^\top x_j with k(xi,xj)k(x_i, x_j), solve the QP, and predict via f^(x)=iαiyik(xi,x)+b\hat f(x) = \sum_i \alpha_i y_i k(x_i, x) + b. The result is a nonlinear classifier: polynomial decision boundary with the polynomial kernel, Gaussian "bumps" with the RBF kernel, and so on.

Kernel Ridge Regression

Ridge regression solves minwyXw2+λw2\min_w \|y - Xw\|^2 + \lambda \|w\|^2, with closed-form solution w^=(XX+λI)1Xy\hat w = (X^\top X + \lambda I)^{-1} X^\top y. Predict y^=xw^\hat y_* = x_*^\top \hat w. By the matrix inversion lemma, this can be rewritten using only kernel evaluations:

y^=k(K+λI)1y,\hat y_* = k_*^\top (K + \lambda I)^{-1} y,

where Kij=k(xi,xj)K_{ij} = k(x_i, x_j) and (k)i=k(xi,x)(k_*)_i = k(x_i, x_*). This is equivalent to ridge regression in the feature space defined by kk. The Bayesian view generalizes to give a full posterior, leading to Gaussian processes.

Kernel PCA

Standard PCA finds the top eigenvectors of XX/nX^\top X / n. Kernel PCA finds the top eigenvectors of the kernel matrix KK (after centering). The principal components are nonlinear functions of xx, useful for nonlinear dimensionality reduction and as a preprocessing step for clustering or visualization. The trick is the same: kernel matrix replaces covariance matrix.

Kernel k-Means

K-means clusters by minimizing within-cluster sum of squared distances. The squared distance xixj2=k(xi,xi)+k(xj,xj)2k(xi,xj)\|x_i - x_j\|^2 = k(x_i, x_i) + k(x_j, x_j) - 2 k(x_i, x_j) depends only on kernel values, so k-means can be run entirely on the kernel matrix. The result is spectral clustering: cluster the eigenvectors of the (normalized) kernel matrix instead of the raw data. With an RBF kernel, this lets k-means discover non-convex cluster boundaries that ordinary k-means cannot.

When the Kernel Trick Stops Helping

The kernel trick converts the cost of working in feature space (linear in DD) into the cost of working with the Gram matrix (cubic in nn). For large datasets (n105n \gtrsim 10^5), the O(n3)O(n^3) cost of forming and inverting KK becomes prohibitive, exactly the regime where deep learning's O(n)O(n)-per-epoch SGD wins. The trick is a great move when dd (or DD) is large and nn is moderate; it becomes a burden when nn is large.

Practical fixes:

  • Nyström approximation: approximate KK by a low-rank matrix using a subsample of mnm \ll n "landmark" points. Cost drops to O(nm2)O(n m^2).
  • Random Fourier features (Rahimi & Recht 2007): approximate k(x,x)z(x)z(x)k(x, x') \approx z(x)^\top z(x') with an explicit DD'-dimensional random feature map zz. Cost drops back to O(nD)O(n D') but you give up the exactness.
  • Inducing points and sparse GPs: the GP analog of Nyström.

The fact that finite-dimensional random feature approximations of the Gaussian RBF kernel work well in practice is what unifies kernel methods with two-layer neural networks (Rahimi-Recht 2008) and motivates the neural tangent kernel story for wide networks.

Common Confusions

Watch Out

The kernel trick is not the same as the RKHS theory

The kernel trick is a computational move: replace inner products with kernel evaluations. The RKHS theory is the theoretical framework that explains why this works: every PSD kernel defines a unique Hilbert space of functions, with the representer theorem guaranteeing that optimal predictors live in a finite-dimensional subspace spanned by the training points. You can use the trick without knowing the theory (most practitioners do), but the theory is what guarantees the trick generalizes, gives risk bounds, and connects kernels to Gaussian processes, neural tangent kernels, and approximation theory.

Watch Out

A bigger feature space is not always better

The Gaussian RBF lifts to infinite features and is "universal," but that does not make it the right choice for every problem. A kernel that lifts too aggressively can overfit by representing too many functions; a kernel that lifts too conservatively can underfit. The right kernel matches the geometry of the data, and is often something simple (linear for high-dimensional sparse data like text, polynomial for low-dimensional structured data, RBF as the default for everything else). The bandwidth σ\sigma of the RBF kernel is a regularizer in disguise.

Watch Out

The kernel trick costs O(n^2) in space and O(n^3) in time

The implicit-feature gain is real, but it comes paired with a Gram-matrix cost. For n=105n = 10^5, the Gram matrix has 101010^{10} entries (terabytes if stored in floats), and inverting it takes 101510^{15} floating-point operations. This is why kernel methods lost ground to deep learning around n=106n = 10^6: the kernel trick scales the wrong way in nn, while SGD on a neural network scales linearly. Random Fourier features and Nyström approximation rescue moderately large kernel problems but never restore the optimal-cost scaling of SGD-based methods.

Watch Out

Not every similarity function is a kernel

"Cosine similarity is a kernel" is true (k(x,x)=xx/xxk(x, x') = x^\top x' / \|x\|\|x'\| is symmetric PSD). "Manhattan-distance similarity is a kernel" is false (k(x,x)=xx1k(x, x') = -\|x - x'\|_1 violates PSD-ness in general). When designing custom similarities, check Mercer's condition: enumerate small Gram matrices and verify PSD-ness, or construct kk from known-PSD building blocks (sums, products, and positive scalar multiples of PSD kernels are PSD; pre-composition with any function is PSD). Verifying PSD-ness on a small test set before plugging into a kernel method saves debugging time downstream.

Summary

  • The kernel trick replaces xi,xj\langle x_i, x_j\rangle with k(xi,xj)k(x_i, x_j) in any inner-product-based algorithm, implicitly lifting data to a (possibly infinite-dimensional) feature space.
  • Worked example: k(x,x)=(xx)2=ϕ(x)ϕ(x)k(x, x') = (x^\top x')^2 = \phi(x)^\top \phi(x') with ϕ(x)=(x12,2x1x2,x22)\phi(x) = (x_1^2, \sqrt 2 \, x_1 x_2, x_2^2). The 2\sqrt 2 factor compensates for the cross-term coefficient in the expansion.
  • The polynomial kernel of degree pp in Rd\mathbb R^d implicitly uses (d+pp)\binom{d+p}{p} features.
  • The Gaussian RBF kernel implicitly uses infinite features (Taylor expansion of exp\exp).
  • Mercer's condition says kk is a kernel iff it is symmetric and positive semi-definite; equivalently, every Gram matrix is PSD.
  • Canonical applications: SVM, ridge regression, PCA, k-means.
  • The trick is great for moderate nn but scales as O(n3)O(n^3) in time; for large nn, random feature maps or deep networks win.

Exercises

ExerciseCore

Problem

For the polynomial kernel k(x,x)=(xx+1)2k(x, x') = (x^\top x' + 1)^2 on R2\mathbb R^2, construct the feature map ϕ\phi explicitly and verify k(x,x)=ϕ(x)ϕ(x)k(x, x') = \phi(x)^\top \phi(x') by direct expansion.

ExerciseCore

Problem

Show that the kernel sum k(x,x)=k1(x,x)+k2(x,x)k(x, x') = k_1(x, x') + k_2(x, x') and the kernel product k(x,x)=k1(x,x)k2(x,x)k(x, x') = k_1(x, x') \cdot k_2(x, x') are both valid kernels whenever k1,k2k_1, k_2 are valid kernels.

ExerciseAdvanced

Problem

Show that the Gaussian RBF kernel k(x,x)=exp(xx2/(2σ2))k(x, x') = \exp(-\|x - x'\|^2 / (2\sigma^2)) is positive semi-definite. (You may use the fact that the Fourier transform of a PSD function on Rd\mathbb R^d is a non-negative function, or construct a feature map directly via the Taylor expansion.)

ExerciseResearch

Problem

The random Fourier features method (Rahimi & Recht 2007) approximates the Gaussian RBF kernel by an explicit finite-dimensional feature map. Sketch the construction: how do you choose the random features so that E[z(x)z(x)]=k(x,x)\mathbb E[z(x)^\top z(x')] = k(x, x')? What is the convergence rate as a function of the number of features DD'?

References

Canonical:

  • Schölkopf, B., & Smola, A.J. (2002). Learning with Kernels. MIT Press. Ch. 2 (the kernel trick and its computational consequences).
  • Aizerman, M.A., Braverman, E.M., & Rozonoer, L.I. (1964). "Theoretical foundations of the potential function method in pattern recognition learning." Automation and Remote Control, 25:821–837. (The original kernel-trick paper, decades before SVM.)
  • Boser, B.E., Guyon, I.M., & Vapnik, V.N. (1992). "A Training Algorithm for Optimal Margin Classifiers." COLT, 144–152. (The first kernel SVM.)
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Ch. 12.

Current:

  • Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. Ch. 16 (kernel methods).
  • Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Ch. 17 (kernels and Gaussian processes).
  • Rahimi, A., & Recht, B. (2007). "Random Features for Large-Scale Kernel Machines." NeurIPS, 1177–1184. (Random Fourier features; the bridge from kernels to neural networks.)
  • Rasmussen, C.E., & Williams, C.K.I. (2006). Gaussian Processes for Machine Learning. MIT Press. Ch. 4 (covariance functions / kernels).

Next Topics

  • Kernels and RKHS: the theoretical framework that makes the kernel trick rigorous — covering Mercer's theorem, the reproducing property, and the representer theorem.
  • Gaussian processes for ML: the Bayesian counterpart of kernel ridge regression with full posterior over functions.
  • Bayesian linear regression: the finite-feature, fully derived posterior that the kernel trick lifts into infinite dimensions.

Last reviewed: May 10, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5