Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Optimization Function Classes

Kernels and Reproducing Kernel Hilbert Spaces

Kernel functions, Mercer's theorem, the RKHS reproducing property, and the representer theorem: the mathematical framework that enables learning in infinite-dimensional function spaces via finite-dimensional computations.

AdvancedTier 2Stable~70 min

Why This Matters

Kernel methods solved one of the earliest fundamental problems in machine learning: how to learn nonlinear decision boundaries using algorithms designed for linear models. The kernel trick replaces inner products x,x\langle x, x' \rangle with kernel evaluations k(x,x)k(x, x'), implicitly mapping data into a high-dimensional (potentially infinite-dimensional) feature space without ever computing the feature vectors.

But kernels are more than a computational trick. The theory of Reproducing Kernel Hilbert Spaces (RKHS) provides a rigorous framework for function-space learning: instead of optimizing over a finite-dimensional parameter vector, you optimize over an infinite-dimensional space of functions, and the representer theorem guarantees that the solution is finite-dimensional anyway.

Understanding RKHS is also essential for modern theory: Gaussian processes are the Bayesian counterpart of kernel methods, neural tangent kernels connect deep learning to kernel regression, and RKHS norms appear in interpolation theory and approximation bounds.

Mental Model

Imagine you want to learn a nonlinear function of xRdx \in \mathbb{R}^d. One approach: map xx to a feature vector ϕ(x)\phi(x) in a higher-dimensional space and learn a linear function there. If ϕ(x)\phi(x) includes all monomials of degree p\leq p, then a linear function in feature space is a polynomial of degree p\leq p in the original space.

The problem: ϕ(x)\phi(x) can be enormous (or infinite-dimensional). The kernel trick observes that many algorithms only access the data through inner products ϕ(x),ϕ(x)\langle \phi(x), \phi(x') \rangle. If you can compute k(x,x)=ϕ(x),ϕ(x)k(x, x') = \langle \phi(x), \phi(x') \rangle directly (without constructing ϕ\phi), you get nonlinear learning at linear cost.

An RKHS is the function space associated with a kernel. It is the space of functions that "the kernel can represent," equipped with a norm that measures function complexity.

Formal Setup

Definition

Positive Definite Kernel

A function k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} is a positive definite (p.d.) kernel if it is symmetric (k(x,x)=k(x,x)k(x, x') = k(x', x)) and for any nn points x1,,xnXx_1, \ldots, x_n \in \mathcal{X} and any α1,,αnR\alpha_1, \ldots, \alpha_n \in \mathbb{R}:

i=1nj=1nαiαjk(xi,xj)0\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j k(x_i, x_j) \geq 0

Equivalently, the Gram matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j) is positive semidefinite for every finite collection of points.

Definition

Reproducing Kernel Hilbert Space (RKHS)

Given a p.d. kernel kk, the RKHS Hk\mathcal{H}_k is the unique Hilbert space of functions f:XRf: \mathcal{X} \to \mathbb{R} satisfying:

  1. Containment: k(,x)Hkk(\cdot, x) \in \mathcal{H}_k for all xXx \in \mathcal{X}
  2. Reproducing property: f(x)=f,k(,x)Hkf(x) = \langle f, k(\cdot, x) \rangle_{\mathcal{H}_k} for all fHkf \in \mathcal{H}_k

The reproducing property says: evaluating ff at xx is the same as taking an inner product with the "kernel function centered at xx." This is what makes the kernel trick work.

Definition

RKHS Norm

The RKHS norm fHk\|f\|_{\mathcal{H}_k} measures the complexity of ff within the function space. For f=i=1nαik(,xi)f = \sum_{i=1}^n \alpha_i k(\cdot, x_i):

fHk2=i,jαiαjk(xi,xj)=αKα\|f\|_{\mathcal{H}_k}^2 = \sum_{i,j} \alpha_i \alpha_j k(x_i, x_j) = \alpha^\top K \alpha

More generally, in the Mercer expansion f=jcjϕjf = \sum_j c_j \phi_j, we have fHk2=jcj2/λj\|f\|_{\mathcal{H}_k}^2 = \sum_j c_j^2 / \lambda_j (see Mercer's Theorem below). The finite-span formula is the specialization when ff lies in span{k(,xi)}\mathrm{span}\{k(\cdot, x_i)\}.

Functions with small RKHS norm are "smooth" relative to the kernel. The RKHS norm plays the role of the regularizer in kernel methods.

Main Theorems

Theorem

Mercer's Theorem

Statement

Let kk be a continuous positive definite kernel on a compact set XRd\mathcal{X} \subset \mathbb{R}^d. Then there exist orthonormal eigenfunctions {ϕj}j=1\{\phi_j\}_{j=1}^\infty and non-negative eigenvalues {λj}j=1\{\lambda_j\}_{j=1}^\infty (with λ1λ20\lambda_1 \geq \lambda_2 \geq \cdots \geq 0) such that:

k(x,x)=j=1λjϕj(x)ϕj(x)k(x, x') = \sum_{j=1}^{\infty} \lambda_j \phi_j(x) \phi_j(x')

The convergence is absolute and uniform on X×X\mathcal{X} \times \mathcal{X}. The RKHS consists of functions f=jcjϕjf = \sum_j c_j \phi_j with fHk2=jcj2/λj<\|f\|_{\mathcal{H}_k}^2 = \sum_j c_j^2/\lambda_j < \infty.

Intuition

Mercer's theorem says every (continuous, p.d.) kernel has a spectral decomposition. a "basis" of features ordered by importance. The feature map ϕ(x)=(λ1ϕ1(x),λ2ϕ2(x),)\phi(x) = (\sqrt{\lambda_1}\phi_1(x), \sqrt{\lambda_2}\phi_2(x), \ldots) explicitly constructs the (possibly infinite-dimensional) feature space in which k(x,x)=ϕ(x),ϕ(x)k(x, x') = \langle \phi(x), \phi(x') \rangle.

The eigenvalues λj\lambda_j control the "effective dimensionality" of the kernel. Fast eigenvalue decay (e.g., exponential for the Gaussian kernel) means the kernel effectively lives in a low-dimensional space despite having infinitely many features.

Proof Sketch

Define the integral operator Tk:L2(X)L2(X)T_k: L^2(\mathcal{X}) \to L^2(\mathcal{X}) by (Tkf)(x)=k(x,x)f(x)dμ(x)(T_k f)(x) = \int k(x, x') f(x') d\mu(x'). By positive definiteness, TkT_k is a positive, self-adjoint, compact operator (compactness follows from continuity of kk and compactness of X\mathcal{X}). By the spectral theorem for compact self-adjoint operators, TkT_k has a countable set of non-negative eigenvalues λj\lambda_j with orthonormal eigenfunctions ϕj\phi_j. The kernel expansion k(x,x)=jλjϕj(x)ϕj(x)k(x,x') = \sum_j \lambda_j \phi_j(x)\phi_j(x') follows from the spectral decomposition of TkT_k.

Why It Matters

Mercer's theorem justifies the "implicit feature space" interpretation of kernels. It also connects kernel methods to approximation theory: the eigenvalue decay rate of kk determines the approximation power of the RKHS, which in turn determines the generalization rate of kernel learning. Faster decay means the effective dimension is lower, so fewer samples are needed.

Failure Mode

Mercer's theorem requires compactness of X\mathcal{X} and continuity of kk. For non-compact domains, you need the more general theory of positive definite functions and the Moore-Aronszajn theorem, which guarantees the existence of an RKHS for any p.d. kernel, not just continuous ones on compact sets.

Theorem

The Representer Theorem

Statement

Consider the regularized empirical risk minimization problem over an RKHS Hk\mathcal{H}_k:

minfHk1ni=1n(f(xi),yi)+g(fHk)\min_{f \in \mathcal{H}_k} \frac{1}{n}\sum_{i=1}^n \ell(f(x_i), y_i) + g(\|f\|_{\mathcal{H}_k})

where g:[0,)Rg: [0, \infty) \to \mathbb{R} is a strictly increasing function. Then every minimizer ff^* has the finite representation:

f(x)=i=1nαik(x,xi)f^*(x) = \sum_{i=1}^n \alpha_i k(x, x_i)

for some coefficients α1,,αnR\alpha_1, \ldots, \alpha_n \in \mathbb{R}.

Intuition

Even though Hk\mathcal{H}_k is infinite-dimensional, the optimal function is a finite linear combination of kernel evaluations at the training points. The regularizer penalizes complexity (RKHS norm), so any component of ff orthogonal to the span of {k(,xi)}\{k(\cdot, x_i)\} contributes to the norm without affecting the training loss. A strictly increasing regularizer therefore "kills" this orthogonal component at optimality.

This reduces an infinite-dimensional optimization problem to a finite-dimensional one: find αRn\alpha \in \mathbb{R}^n minimizing 1ni(jαjk(xi,xj),yi)+g(αKα)\frac{1}{n}\sum_i \ell(\sum_j \alpha_j k(x_i, x_j), y_i) + g(\sqrt{\alpha^\top K \alpha}).

Proof Sketch

Decompose any fHkf \in \mathcal{H}_k as f=f+ff = f_\parallel + f_\perp where f=i=1nαik(,xi)f_\parallel = \sum_{i=1}^n \alpha_i k(\cdot, x_i) is in the span of the kernel functions at training points and ff_\perp is orthogonal.

By the reproducing property: f(xi)=f,k(,xi)=f,k(,xi)=f(xi)f(x_i) = \langle f, k(\cdot, x_i)\rangle = \langle f_\parallel, k(\cdot, x_i)\rangle = f_\parallel(x_i). So ff_\perp does not affect any f(xi)f(x_i) and hence does not affect the loss.

By Pythagorean theorem: f2=f2+f2f2\|f\|^2 = \|f_\parallel\|^2 + \|f_\perp\|^2 \geq \|f_\parallel\|^2. Since gg is strictly increasing, g(f)g(f)g(\|f\|) \geq g(\|f_\parallel\|) with equality only when f=0f_\perp = 0.

Therefore, setting f=0f_\perp = 0 does not increase the regularizer (strictly decreases it whenever f0f_\perp \neq 0) while keeping the loss unchanged; therefore every minimizer satisfies f=ff^* = f^*_\parallel.

Why It Matters

The representer theorem is what makes kernel methods computationally feasible. Without it, optimizing over an RKHS would require searching through an infinite-dimensional space. With it, you solve an n×nn \times n problem (or n×nn \times n linear system for squared loss). The cost is O(n3)O(n^3) for exact methods, which is why kernel methods scale poorly to large nn (a separate problem from the theoretical power of the method).

Failure Mode

The representer theorem requires a strictly increasing regularizer. Without regularization (g=0g = 0), the theorem does not apply: there may be minimizers with nonzero orthogonal components. Also, the representation f=iαik(,xi)f^* = \sum_i \alpha_i k(\cdot, x_i) involves nn terms, so the model complexity grows with the dataset size. This is both a feature (the model adapts) and a bug (prediction cost is O(n)O(n) per test point).

Standard Kernels

The most common kernels and their properties:

Linear kernel: k(x,x)=xxk(x, x') = x^\top x'. RKHS = linear functions. Mercer eigenvalues depend on the data distribution. No nonlinearity.

Polynomial kernel: k(x,x)=(xx+c)pk(x, x') = (x^\top x' + c)^p with c0c \geq 0. For c>0c > 0, the RKHS on a compact set equals the polynomials of degree p\leq p in dd variables (dimension (d+pp)\binom{d+p}{p}), equipped with a specific weighted-coefficient norm derived from multinomial coefficients (Steinwart & Christmann 2008, Example 4.11). For c=0c = 0 the kernel is homogeneous and the RKHS consists of homogeneous polynomials of degree exactly pp.

Gaussian (RBF) kernel: k(x,x)=exp(xx2/(2σ2))k(x, x') = \exp(-\|x - x'\|^2/(2\sigma^2)). The RKHS is infinite-dimensional. On compact domains or under Gaussian input measure the Mercer eigenvalues decay geometrically (λjBj\lambda_j \sim B^j for some B(0,1)B \in (0,1); see Rasmussen-Williams 2006 Section 4.3 eq. 4.42, Zhu et al. 1998). Universal (Steinwart 2001; Micchelli-Xu-Zhang 2006): the RKHS is dense in C(X)C(\mathcal{X}) in the sup-norm, so any continuous function on a compact X\mathcal{X} can be approximated arbitrarily well by RKHS elements. Universal does not mean the RKHS contains every continuous function. Bandwidth σ\sigma controls smoothness.

Laplacian kernel k(x,x)=exp(xx/σ)k(x, x') = \exp(-\|x - x'\|/\sigma): this is the Matern kernel with ν=1/2\nu = 1/2 up to rescaling. The RKHS on Rd\mathbb{R}^d is the Sobolev space H(d+1)/2H^{(d+1)/2} (up to norm equivalence; Wendland 2005 Ch. 10, Fasshauer 2011). Sample functions are continuous but generally non-differentiable in the classical sense; they have square-integrable weak derivatives of fractional order. Used when the target function is rough.

Connection to SVMs

The support vector machine (SVM) is regularized ERM with hinge loss in an RKHS:

minfHk1ni=1nmax(0,1yif(xi))+λfHk2\min_{f \in \mathcal{H}_k} \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i f(x_i)) + \lambda\|f\|_{\mathcal{H}_k}^2

By the representer theorem, f=iαik(,xi)f^* = \sum_i \alpha_i k(\cdot, x_i). The dual problem (via Lagrangian duality from convex optimization) produces the "support vector" formulation: at optimality, the solution is typically sparse, with most αi=0\alpha_i = 0. The nonzero αi\alpha_i correspond to the support vectors. For soft-margin SVM with box constraint 0αiC0 \leq \alpha_i \leq C, KKT conditions split these into (i) margin points with 0<αi<C0 < \alpha_i < C lying exactly on the margin (yif(xi)=1y_i f(x_i) = 1), and (ii) margin violators with αi=C\alpha_i = C lying inside the margin or misclassified (yif(xi)<1y_i f(x_i) < 1). The "closest to the decision boundary" picture applies only to hard-margin SVM on separable data (Cortes & Vapnik 1995). Sparsity of the dual solution is a consequence of the hinge loss's non-differentiable kink at margin =1= 1 and the KKT complementary-slackness conditions (Steinwart-Christmann 2008 Thm. 5.28).

This sparsity is a bonus: prediction cost is O(SVd)O(|\text{SV}| \cdot d) rather than O(nd)O(n \cdot d), where SV|\text{SV}| is the number of support vectors.

Common Confusions

Watch Out

The representer theorem is structural, not a recommendation to use kernel methods

The representer theorem says: "if you regularize with RKHS norm and your loss only depends on function values at training points, then the optimum is a kernel expansion." It does not say kernel methods are the best approach. In high dimensions with large datasets, neural networks typically outperform kernel methods despite lacking a representer theorem. The theorem is about the structure of the solution, not its quality.

Confusing "the solution has a nice form" with "the method is good" is a common error. The representer theorem is a mathematical fact, not practical advice.

Watch Out

The kernel trick is not about avoiding computation of phi(x)

A common oversimplification: "the kernel trick avoids computing the high-dimensional feature map." More precisely, the kernel trick avoids the explicit dependence on feature dimension. The cost instead depends on nn (number of training points) through the n×nn \times n Gram matrix. For large nn, this can be worse than working with explicit features when the feature dimension is moderate. The kernel trick trades feature dimension for sample size.

Watch Out

RKHS norm is not just any smoothness measure

The RKHS norm fHk\|f\|_{\mathcal{H}_k} measures smoothness relative to the kernel. For the Gaussian kernel, large RKHS norm means the function has high-frequency content. For the polynomial kernel, it means large coefficients on high-degree monomials. Different kernels induce different notions of "complexity." Choosing the wrong kernel means the RKHS norm does not correspond to the relevant notion of smoothness for your problem.

Canonical Examples

Example

Kernel ridge regression

Kernel ridge regression minimizes 1ni(yif(xi))2+λfHk2\frac{1}{n}\sum_i (y_i - f(x_i))^2 + \lambda\|f\|_{\mathcal{H}_k}^2. By the representer theorem, f=jαjk(,xj)f^* = \sum_j \alpha_j k(\cdot, x_j) and f(xi)=jαjk(xi,xj)=(Kα)if^*(x_i) = \sum_j \alpha_j k(x_i, x_j) = (K\alpha)_i. Substituting:

minα1nyKα2+λαKα\min_\alpha \frac{1}{n}\|y - K\alpha\|^2 + \lambda \alpha^\top K \alpha

Taking the gradient and setting to zero: α=(K+nλI)1y\alpha^* = (K + n\lambda I)^{-1} y. This is solvable in O(n3)O(n^3) time. The solution is the Bayesian posterior mean under a Gaussian process prior with covariance kernel kk.

Example

Polynomial kernel for quadratic features

With k(x,x)=(xx)2k(x, x') = (x^\top x')^2 on R2\mathbb{R}^2, the feature map is ϕ(x)=(x12,x22,2x1x2)\phi(x) = (x_1^2, x_2^2, \sqrt{2} x_1 x_2)^\top. Computing k(x,x)k(x, x') costs O(d)O(d) (one inner product), while computing ϕ(x),ϕ(x)\langle\phi(x), \phi(x')\rangle explicitly costs O(d2)O(d^2). For degree-pp polynomial kernel in dd dimensions, the feature space has dimension (d+pp)\binom{d+p}{p}, which grows rapidly, but k(x,x)k(x,x') always costs O(d)O(d).

Exercises

ExerciseCore

Problem

Verify the reproducing property for the linear kernel k(x,x)=xxk(x, x') = x^\top x'. What is the RKHS? What is the RKHS norm of f(x)=wxf(x) = w^\top x?

ExerciseCore

Problem

Show that the Gram matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j) of a p.d. kernel is positive semidefinite. Why does this imply αKα0\alpha^\top K\alpha \geq 0 for all αRn\alpha \in \mathbb{R}^n?

ExerciseAdvanced

Problem

Prove that without a regularizer (g=0g = 0), the representer theorem does not hold. Construct a loss function and kernel where the minimizer of 1ni(f(xi),yi)\frac{1}{n}\sum_i \ell(f(x_i), y_i) over Hk\mathcal{H}_k is not in the span of {k(,xi)}\{k(\cdot, x_i)\}.

Related Comparisons

References

Canonical:

  • Scholkopf & Smola, Learning with Kernels (2002), Chapters 1-4
  • Aronszajn, "Theory of Reproducing Kernels" (1950), Trans. AMS
  • Cortes & Vapnik, "Support-Vector Networks" (1995), Machine Learning
  • Rasmussen & Williams, Gaussian Processes for Machine Learning (2006), Section 4.3
  • Wendland, Scattered Data Approximation (2005), Chapter 10

Current:

  • Steinwart & Christmann, Support Vector Machines (2008), Chapters 4-5 (Example 4.11, Theorem 5.28)

  • Steinwart, "On the influence of the kernel on the consistency of support vector machines" (2001), JMLR

  • Micchelli, Xu & Zhang, "Universal Kernels" (2006), JMLR

  • Fasshauer, "Positive Definite Kernels: Past, Present and Future" (2011), Dolomites Research Notes on Approximation

  • Zhu, Williams, Rohwer & Morciniec, "Gaussian regression and optimal finite dimensional linear models" (1998), Neural Networks and Machine Learning

  • Bartlett & Mendelson, "Rademacher and Gaussian Complexities" (2002), JMLR. for RKHS Rademacher bounds

  • Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5

  • Nesterov, Introductory Lectures on Convex Optimization (2004), Chapters 1-3

Next Topics

The natural next step from kernels and RKHS:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics