ML Methods
The Kernel Trick
Any algorithm whose computation accesses data only through inner products can be lifted to a feature space implicitly defined by a kernel, without ever computing the features. Worked polynomial-kernel inner-product equivalence, infinite-feature interpretation of the Gaussian RBF, Mercer's condition, and the four canonical applications: kernel SVM, kernel ridge regression, kernel PCA, kernel k-means.
Prerequisites
Why This Matters
The kernel trick is the move that turns linear algorithms into nonlinear ones at no algorithmic cost. Take any method that computes only with inner products (SVM, ridge regression, PCA, k-means, perceptron) and replace those inner products with a kernel function that secretly computes inner products in some higher-dimensional feature space. The algorithm runs unchanged; it just thinks in a richer geometry.
The trick has two practical wins and one theoretical payoff:
- Practical: a polynomial kernel of degree in computes in time, but operates as if the data had been lifted to features. The Gaussian RBF lifts to an infinite-dimensional space and still computes in time per pair.
- Theoretical: Mercer's condition tells you exactly which 's are inner products in some space (positive semi-definite ones), and the RKHS theory gives a precise account of the function space those kernels span.
This page focuses on the trick itself with the algebra worked out end-to-end. For the full theory of reproducing kernel Hilbert spaces, see kernels and RKHS.
Mental Model
A linear classifier in separates points with a hyperplane . If the data is not linearly separable, you can lift it: define for some larger , classify in feature space with , and project back. The new decision boundary in is nonlinear.
Two problems with the explicit lift:
- can be huge (or infinite for some useful ).
- Computing for every training point costs , and the algorithm needs inner products at minimum.
The kernel trick observes: many algorithms never need itself. They only need pairwise inner products . If we can compute directly from and , without constructing or , we get the nonlinear lift for free.
The polynomial kernel is the canonical illustration. Expanding in gives back exactly the inner product of two specific 3-dimensional vectors, at cost instead of .
Formal Setup
The Kernel Trick
A learning algorithm is kernelizable if every operation involving inputs can be written using only inner products (or, equivalently, pairwise squared distances ). Given a kernel that is a symmetric, positive semi-definite function, the kernel trick replaces every in the algorithm with .
The result is an algorithm that operates as if it had access to the lifted features , where is the (possibly infinite-dimensional) feature map associated with by Mercer's theorem, without ever computing explicitly.
The kernel trick is a meta-pattern, not an algorithm. You can kernelize SVM, ridge regression, PCA, k-means, perceptron, k-nearest-neighbors, and more, each producing a "kernel version" of the original method.
The Polynomial Kernel: Worked Inner-Product Equivalence
This is the worked algebra that makes the trick concrete. Take and the degree-2 polynomial kernel
Claim. where is
Proof by direct expansion.
Recombine:
Now match this against the inner product of the proposed feature vectors:
These are identical. The factor in is exactly what produces the in the cross-term of the kernel expansion. The trick is computing in operations (one inner product and one squaring), while the explicit feature inner product would cost . The asymptotics matter when is larger.
Polynomial Kernel Equals Inner Product in a Polynomial Feature Space
Statement
For , the polynomial kernel with and integer satisfies , where with is the feature map containing all monomials in of total degree exactly , each weighted by the square root of its multinomial coefficient.
So computing costs (one inner product plus a power), while explicit features cost . For and , , three orders of magnitude larger than .
Intuition
A polynomial of degree in variables has monomials. The polynomial kernel implicitly represents the full degree- polynomial space, so a linear classifier in feature space becomes a degree- polynomial classifier in . The trick (interpreting the constant as a fictitious feature) is what lets all degrees up to appear simultaneously, not just degree .
Proof Sketch
Use the multinomial theorem on . Each term is of the form with . Rewrite the multinomial coefficient as and absorb a into each of the two product factors. Then the term equals , which is the product of two coordinates of and . Summing over all multi-indices reconstructs .
Why It Matters
This is the concrete demonstration of the kernel trick. Every other kernel-trick claim ultimately reduces to "a kernel computes an inner product in a feature space implicitly," and the polynomial kernel is where the algebra is fully visible. Once you have seen it once, the Gaussian RBF (below, infinite-dimensional features) and the abstract Mercer theory feel like extensions rather than mysteries.
Failure Mode
The polynomial kernel of degree produces features that grow combinatorially with and . For very large , the kernel matrix can become ill-conditioned (high-degree monomials have wildly different scales), and numerical inversion of in kernel ridge regression suffers. Practical use: or so for most tasks, with feature normalization before applying . For higher nonlinearity, the Gaussian RBF is usually a better choice.
The Gaussian RBF Kernel: Infinite Feature Space
The Gaussian kernel
also corresponds to an inner product in a feature space, except that feature space is infinite-dimensional. The cleanest way to see this is via the Taylor expansion of the exponential. Expand as and factor:
Now Taylor-expand the second exponential:
where is the degree- polynomial feature map from the previous section. So
The feature vector is infinite-dimensional, with one block per polynomial degree, weighted by and modulated by the Gaussian envelope. Computing this feature vector explicitly is impossible; computing the kernel takes operations.
The Gaussian RBF is called universal (Steinwart 2001): its associated RKHS is dense in for any compact . So linear methods in the Gaussian RBF feature space can approximate any continuous function uniformly. This is the theoretical reason RBF SVMs are the default choice when you don't know much about your data.
Mercer's Condition
For a function to be a valid kernel (an inner product in some feature space), it must be symmetric and positive semi-definite.
Mercer's Condition for Validity of a Kernel
Statement
A symmetric function is a kernel (i.e. there exists a feature map with ) if and only if it is positive semi-definite: for every finite collection of points and coefficients ,
Equivalently, the Gram matrix with entries is positive semi-definite for every finite sample.
Intuition
If , then for any , . The PSD property is automatic for inner products. Mercer's theorem says the converse: every symmetric PSD function arises as an inner product in some Hilbert space (the RKHS).
Proof Sketch
The forward direction (kernel implies PSD) follows from the inner-product factorization. The reverse direction is Mercer's theorem: a continuous PSD kernel on a compact domain has a spectral decomposition with and orthonormal . The feature map is , and the Hilbert space of square-summable sequences gives the feature space. The full proof uses the spectral theorem for compact self-adjoint operators; see kernels and RKHS for details.
Why It Matters
Mercer's condition is the practical test for whether a similarity function you've defined can be plugged into a kernel SVM, kernel ridge regression, or any other kernel method. Many domain-specific similarities (string kernels, graph kernels, sequence-alignment kernels) fail PSD-ness, at which point the algorithm becomes ill-defined: the kernel SVM dual may have no solution, or kernel PCA may have negative eigenvalues that the method can't interpret. PSD-ness is the floor below which kernel methods break.
Failure Mode
The hyperbolic tangent kernel is not PSD for all . Many SVM libraries accept it anyway, because the resulting optimization is still well-defined (the SVM is solved via a QP that does not strictly require PSD-ness of ). The downside is that the optimization can fail or return solutions that don't generalize. The "sigmoid kernel" works in practice for some problems but loses the theoretical guarantees that PSD kernels carry.
Four Canonical Applications
The same trick, four different host algorithms:
Kernel SVM
The SVM dual depends on data only through dot products :
Replace with , solve the QP, and predict via . The result is a nonlinear classifier: polynomial decision boundary with the polynomial kernel, Gaussian "bumps" with the RBF kernel, and so on.
Kernel Ridge Regression
Ridge regression solves , with closed-form solution . Predict . By the matrix inversion lemma, this can be rewritten using only kernel evaluations:
where and . This is equivalent to ridge regression in the feature space defined by . The Bayesian view generalizes to give a full posterior, leading to Gaussian processes.
Kernel PCA
Standard PCA finds the top eigenvectors of . Kernel PCA finds the top eigenvectors of the kernel matrix (after centering). The principal components are nonlinear functions of , useful for nonlinear dimensionality reduction and as a preprocessing step for clustering or visualization. The trick is the same: kernel matrix replaces covariance matrix.
Kernel k-Means
K-means clusters by minimizing within-cluster sum of squared distances. The squared distance depends only on kernel values, so k-means can be run entirely on the kernel matrix. The result is spectral clustering: cluster the eigenvectors of the (normalized) kernel matrix instead of the raw data. With an RBF kernel, this lets k-means discover non-convex cluster boundaries that ordinary k-means cannot.
When the Kernel Trick Stops Helping
The kernel trick converts the cost of working in feature space (linear in ) into the cost of working with the Gram matrix (cubic in ). For large datasets (), the cost of forming and inverting becomes prohibitive, exactly the regime where deep learning's -per-epoch SGD wins. The trick is a great move when (or ) is large and is moderate; it becomes a burden when is large.
Practical fixes:
- Nyström approximation: approximate by a low-rank matrix using a subsample of "landmark" points. Cost drops to .
- Random Fourier features (Rahimi & Recht 2007): approximate with an explicit -dimensional random feature map . Cost drops back to but you give up the exactness.
- Inducing points and sparse GPs: the GP analog of Nyström.
The fact that finite-dimensional random feature approximations of the Gaussian RBF kernel work well in practice is what unifies kernel methods with two-layer neural networks (Rahimi-Recht 2008) and motivates the neural tangent kernel story for wide networks.
Common Confusions
The kernel trick is not the same as the RKHS theory
The kernel trick is a computational move: replace inner products with kernel evaluations. The RKHS theory is the theoretical framework that explains why this works: every PSD kernel defines a unique Hilbert space of functions, with the representer theorem guaranteeing that optimal predictors live in a finite-dimensional subspace spanned by the training points. You can use the trick without knowing the theory (most practitioners do), but the theory is what guarantees the trick generalizes, gives risk bounds, and connects kernels to Gaussian processes, neural tangent kernels, and approximation theory.
A bigger feature space is not always better
The Gaussian RBF lifts to infinite features and is "universal," but that does not make it the right choice for every problem. A kernel that lifts too aggressively can overfit by representing too many functions; a kernel that lifts too conservatively can underfit. The right kernel matches the geometry of the data, and is often something simple (linear for high-dimensional sparse data like text, polynomial for low-dimensional structured data, RBF as the default for everything else). The bandwidth of the RBF kernel is a regularizer in disguise.
The kernel trick costs O(n^2) in space and O(n^3) in time
The implicit-feature gain is real, but it comes paired with a Gram-matrix cost. For , the Gram matrix has entries (terabytes if stored in floats), and inverting it takes floating-point operations. This is why kernel methods lost ground to deep learning around : the kernel trick scales the wrong way in , while SGD on a neural network scales linearly. Random Fourier features and Nyström approximation rescue moderately large kernel problems but never restore the optimal-cost scaling of SGD-based methods.
Not every similarity function is a kernel
"Cosine similarity is a kernel" is true ( is symmetric PSD). "Manhattan-distance similarity is a kernel" is false ( violates PSD-ness in general). When designing custom similarities, check Mercer's condition: enumerate small Gram matrices and verify PSD-ness, or construct from known-PSD building blocks (sums, products, and positive scalar multiples of PSD kernels are PSD; pre-composition with any function is PSD). Verifying PSD-ness on a small test set before plugging into a kernel method saves debugging time downstream.
Summary
- The kernel trick replaces with in any inner-product-based algorithm, implicitly lifting data to a (possibly infinite-dimensional) feature space.
- Worked example: with . The factor compensates for the cross-term coefficient in the expansion.
- The polynomial kernel of degree in implicitly uses features.
- The Gaussian RBF kernel implicitly uses infinite features (Taylor expansion of ).
- Mercer's condition says is a kernel iff it is symmetric and positive semi-definite; equivalently, every Gram matrix is PSD.
- Canonical applications: SVM, ridge regression, PCA, k-means.
- The trick is great for moderate but scales as in time; for large , random feature maps or deep networks win.
Exercises
Problem
For the polynomial kernel on , construct the feature map explicitly and verify by direct expansion.
Problem
Show that the kernel sum and the kernel product are both valid kernels whenever are valid kernels.
Problem
Show that the Gaussian RBF kernel is positive semi-definite. (You may use the fact that the Fourier transform of a PSD function on is a non-negative function, or construct a feature map directly via the Taylor expansion.)
Problem
The random Fourier features method (Rahimi & Recht 2007) approximates the Gaussian RBF kernel by an explicit finite-dimensional feature map. Sketch the construction: how do you choose the random features so that ? What is the convergence rate as a function of the number of features ?
References
Canonical:
- Schölkopf, B., & Smola, A.J. (2002). Learning with Kernels. MIT Press. Ch. 2 (the kernel trick and its computational consequences).
- Aizerman, M.A., Braverman, E.M., & Rozonoer, L.I. (1964). "Theoretical foundations of the potential function method in pattern recognition learning." Automation and Remote Control, 25:821–837. (The original kernel-trick paper, decades before SVM.)
- Boser, B.E., Guyon, I.M., & Vapnik, V.N. (1992). "A Training Algorithm for Optimal Margin Classifiers." COLT, 144–152. (The first kernel SVM.)
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Ch. 12.
Current:
- Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. Ch. 16 (kernel methods).
- Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Ch. 17 (kernels and Gaussian processes).
- Rahimi, A., & Recht, B. (2007). "Random Features for Large-Scale Kernel Machines." NeurIPS, 1177–1184. (Random Fourier features; the bridge from kernels to neural networks.)
- Rasmussen, C.E., & Williams, C.K.I. (2006). Gaussian Processes for Machine Learning. MIT Press. Ch. 4 (covariance functions / kernels).
Next Topics
- Kernels and RKHS: the theoretical framework that makes the kernel trick rigorous — covering Mercer's theorem, the reproducing property, and the representer theorem.
- Gaussian processes for ML: the Bayesian counterpart of kernel ridge regression with full posterior over functions.
- Bayesian linear regression: the finite-feature, fully derived posterior that the kernel trick lifts into infinite dimensions.
Last reviewed: May 10, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Common Probability Distributionslayer 0A · tier 1
- Convex Optimization Basicslayer 1 · tier 1
- Ridge Regressionlayer 1 · tier 1
- Bayesian Linear Regressionlayer 2 · tier 1
- Support Vector Machineslayer 2 · tier 1
Derived topics
4- Gram Matrices and Kernel Matriceslayer 1 · tier 1
- Gaussian Process Regressionlayer 3 · tier 2
- Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
- Gaussian Processes for Machine Learninglayer 4 · tier 3