Kernels and Reproducing Kernel Hilbert Spaces

Sneiderman, Robby

Optimization Function Classes

Kernels and Reproducing Kernel Hilbert Spaces

Kernel functions, Mercer's theorem, the RKHS reproducing property, and the representer theorem: the mathematical framework that enables learning in infinite-dimensional function spaces via finite-dimensional computations.

AdvancedTier 2StableSupporting~70 min

Prerequisites

Convex Optimization Basics Rademacher Complexity Characteristic Functions Convex Duality

Quiz (5)Pulse Check Prereq Map

Why This Matters

Kernel methods solved one of the earliest fundamental problems in machine learning: how to learn nonlinear decision boundaries using algorithms designed for linear models. The kernel trick replaces inner products $\langle x, x' \rangle$ with kernel evaluations $k(x, x')$ , implicitly mapping data into a high-dimensional (potentially infinite-dimensional) feature space without ever computing the feature vectors.

Six-panel infographic on kernels and reproducing kernel Hilbert spaces: (1) the kernel trick mapping nonlinear input to a high-dimensional feature space, (2) Gram matrix and positive semidefiniteness as a Mercer condition, (3) the reproducing property f(x) = <f, k(·,x)> in an RKHS, (4) the representer theorem, (5) common kernels (linear, polynomial, Gaussian RBF), (6) why it matters: nonlinear decision boundaries with linear methods, foundation for SVMs and kernel ridge regression, RKHS norm controls smoothness, and connections to Gaussian processes and the neural tangent kernel. — Kernels let you do geometry in an implicit feature space. RKHS theory is why function-space optimization stays computationally finite.

But kernels are more than a computational trick. The theory of Reproducing Kernel Hilbert Spaces (RKHS) provides a rigorous framework for function-space learning: instead of optimizing over a finite-dimensional parameter vector, you optimize over an infinite-dimensional space of functions, and the representer theorem guarantees that the solution is finite-dimensional anyway.

Understanding RKHS is also essential for modern theory: Gaussian processes are the Bayesian counterpart of kernel methods, neural tangent kernels connect deep learning to kernel regression, and RKHS norms appear in interpolation theory and approximation bounds.

theorem visual

Representer Theorem Ledger

$Loss only sees the training-point span, while the RKHS norm penalizes every direction, so the optimizer drops the orthogonal part.$

Loss sees

Only the values at the observed anchors: $f (x_{1}), \dots, f (x_{n})$

RKHS norm sees

Both pieces of the decomposition: $∥ f ∥_{H_{k}}^{2} = ∥ f_{∥} ∥^{2} + ∥ f_{⊥} ∥^{2}$

Optimal consequence

Keep the training-span part, kill the orthogonal ghost, and the optimizer stays finite-dimensional.

Mental Model

Imagine you want to learn a nonlinear function of $x \in \mathbb{R}^d$ . One approach: map $x$ to a feature vector $\phi(x)$ in a higher-dimensional space and learn a linear function there. If $\phi(x)$ includes all monomials of degree $\leq p$ , then a linear function in feature space is a polynomial of degree $\leq p$ in the original space.

The problem: $\phi(x)$ can be enormous (or infinite-dimensional). The kernel trick observes that many algorithms only access the data through inner products $\langle \phi(x), \phi(x') \rangle$ . If you can compute $k(x, x') = \langle \phi(x), \phi(x') \rangle$ directly (without constructing $\phi$ ), you get nonlinear learning at linear cost.

An RKHS is the function space associated with a kernel. It is the space of functions that "the kernel can represent," equipped with a norm that measures function complexity.

Formal Setup

Definition

Positive Definite Kernel $k : X \times X \to R$

A function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is a positive definite (p.d.) kernel if and only if it is symmetric ( $k(x, x') = k(x', x)$ ) and for any $n$ points $x_1, \ldots, x_n \in \mathcal{X}$ and any $\alpha_1, \ldots, \alpha_n \in \mathbb{R}$ :

$\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j k(x_i, x_j) \geq 0$

Equivalently, the Gram matrix $K_{ij} = k(x_i, x_j)$ is positive semidefinite for every finite collection of points.

Definition

Reproducing Kernel Hilbert Space (RKHS) $H_{k}$

Given a p.d. kernel $k$ , the RKHS $\mathcal{H}_k$ is the unique Hilbert space of functions $f: \mathcal{X} \to \mathbb{R}$ satisfying:

Containment: $k(\cdot, x) \in \mathcal{H}_k$ for all $x \in \mathcal{X}$
Reproducing property: $f(x) = \langle f, k(\cdot, x) \rangle_{\mathcal{H}_k}$ for all $f \in \mathcal{H}_k$

The reproducing property says: evaluating $f$ at $x$ is the same as taking an inner product with the "kernel function centered at $x$ ." This is what makes the kernel trick work.

Definition

RKHS Norm $∥ f ∥_{H_{k}}$

The RKHS norm $\|f\|_{\mathcal{H}_k}$ measures the complexity of $f$ within the function space. For $f = \sum_{i=1}^n \alpha_i k(\cdot, x_i)$ :

$\|f\|_{\mathcal{H}_k}^2 = \sum_{i,j} \alpha_i \alpha_j k(x_i, x_j) = \alpha^\top K \alpha$

More generally, in the Mercer expansion $f = \sum_j c_j \phi_j$ , we have $\|f\|_{\mathcal{H}_k}^2 = \sum_j c_j^2 / \lambda_j$ (see Mercer's Theorem below). The finite-span formula is the specialization when $f$ lies in $\mathrm{span}\{k(\cdot, x_i)\}$ .

Functions with small RKHS norm are "smooth" relative to the kernel. The RKHS norm plays the role of the regularizer in kernel methods.

Main Theorems

Theorem

Mercer's Theorem

Statement

Let $k$ be a continuous positive definite kernel on a compact set $\mathcal{X} \subset \mathbb{R}^d$ . Then there exist orthonormal eigenfunctions $\{\phi_j\}_{j=1}^\infty$ and non-negative eigenvalues $\{\lambda_j\}_{j=1}^\infty$ (with $\lambda_1 \geq \lambda_2 \geq \cdots \geq 0$ ) such that:

$k(x, x') = \sum_{j=1}^{\infty} \lambda_j \phi_j(x) \phi_j(x')$

The convergence is absolute and uniform on $\mathcal{X} \times \mathcal{X}$ . The RKHS consists of functions $f = \sum_j c_j \phi_j$ with $\|f\|_{\mathcal{H}_k}^2 = \sum_j c_j^2/\lambda_j < \infty$ .

Intuition

Mercer's theorem says every (continuous, p.d.) kernel has a spectral decomposition. a "basis" of features ordered by importance. The feature map $\phi(x) = (\sqrt{\lambda_1}\phi_1(x), \sqrt{\lambda_2}\phi_2(x), \ldots)$ explicitly constructs the (possibly infinite-dimensional) feature space in which $k(x, x') = \langle \phi(x), \phi(x') \rangle$ .

The eigenvalues $\lambda_j$ control the "effective dimensionality" of the kernel. Fast eigenvalue decay (e.g., exponential for the Gaussian kernel) means the kernel effectively lives in a low-dimensional space despite having infinitely many features.

Proof Sketch

Define the integral operator $T_k: L^2(\mathcal{X}) \to L^2(\mathcal{X})$ by $(T_k f)(x) = \int k(x, x') f(x') d\mu(x')$ . By positive definiteness, $T_k$ is a positive, self-adjoint, compact operator (compactness follows from continuity of $k$ and compactness of $\mathcal{X}$ ). By the spectral theorem for compact self-adjoint operators, $T_k$ has a countable set of non-negative eigenvalues $\lambda_j$ with orthonormal eigenfunctions $\phi_j$ . The kernel expansion $k(x,x') = \sum_j \lambda_j \phi_j(x)\phi_j(x')$ follows from the spectral decomposition of $T_k$ .

Why It Matters

Mercer's theorem justifies the "implicit feature space" interpretation of kernels. It also connects kernel methods to approximation theory: the eigenvalue decay rate of $k$ determines the approximation power of the RKHS, which in turn determines the generalization rate of kernel learning. Faster decay means the effective dimension is lower, so fewer samples are needed.

Failure Mode

Mercer's theorem requires compactness of $\mathcal{X}$ and continuity of $k$ . For non-compact domains, you need the more general theory of positive definite functions and the Moore-Aronszajn theorem, which guarantees the existence of an RKHS for any p.d. kernel, not just continuous ones on compact sets.

report a correction →

Theorem

The Representer Theorem

Statement

Consider the regularized empirical risk minimization problem over an RKHS $\mathcal{H}_k$ :

$\min_{f \in \mathcal{H}_k} \frac{1}{n}\sum_{i=1}^n \ell(f(x_i), y_i) + g(\|f\|_{\mathcal{H}_k})$

where $g: [0, \infty) \to \mathbb{R}$ is a strictly increasing function. Then every minimizer $f^*$ has the finite representation:

$f^*(x) = \sum_{i=1}^n \alpha_i k(x, x_i)$

for some coefficients $\alpha_1, \ldots, \alpha_n \in \mathbb{R}$ .

Intuition

Even though $\mathcal{H}_k$ is infinite-dimensional, the optimal function is a finite linear combination of kernel evaluations at the training points. The regularizer penalizes complexity (RKHS norm), so any component of $f$ orthogonal to the span of $\{k(\cdot, x_i)\}$ contributes to the norm without affecting the training loss. A strictly increasing regularizer therefore "kills" this orthogonal component at optimality.

This reduces an infinite-dimensional optimization problem to a finite-dimensional one: find $\alpha \in \mathbb{R}^n$ minimizing $\frac{1}{n}\sum_i \ell(\sum_j \alpha_j k(x_i, x_j), y_i) + g(\sqrt{\alpha^\top K \alpha})$ .

Proof Sketch

Decompose any $f \in \mathcal{H}_k$ as $f = f_\parallel + f_\perp$ where $f_\parallel = \sum_{i=1}^n \alpha_i k(\cdot, x_i)$ is in the span of the kernel functions at training points and $f_\perp$ is orthogonal.

By the reproducing property: $f(x_i) = \langle f, k(\cdot, x_i)\rangle = \langle f_\parallel, k(\cdot, x_i)\rangle = f_\parallel(x_i)$ . So $f_\perp$ does not affect any $f(x_i)$ and hence does not affect the loss.

Therefore, setting $f_\perp = 0$ does not increase the regularizer (strictly decreases it whenever $f_\perp \neq 0$ ) while keeping the loss unchanged; therefore every minimizer satisfies $f^* = f^*_\parallel$ .

Why It Matters

The representer theorem is what makes kernel methods computationally feasible. Without it, optimizing over an RKHS would require searching through an infinite-dimensional space. With it, you solve an $n \times n$ problem (or $n \times n$ linear system for squared loss). The cost is $O(n^3)$ for exact methods, which is why kernel methods scale poorly to large $n$ (a separate problem from the theoretical power of the method).

Failure Mode

The representer theorem as stated requires a strictly increasing regularizer. Without regularization ( $g = 0$ ), the standard argument no longer forces $f^*_\perp = 0$ : minimizers with nonzero orthogonal components may also attain the optimum, so the conclusion "every minimizer lies in $\operatorname{span}\{k(\cdot, x_i)\}$ " weakens to "some minimizer lies in that span" (the empirical loss is invariant to $f^*_\perp$ on the training points). For squared loss with $g = 0$ there is still a representer-style solution via the normal equations; the failure is uniqueness, not existence. Also, the representation $f^* = \sum_i \alpha_i k(\cdot, x_i)$ involves $n$ terms, so the model complexity grows with the dataset size. This is both a feature (the model adapts) and a bug (prediction cost is $O(n)$ per test point).

report a correction →

Standard Kernels

The most common kernels and their properties:

Linear kernel: $k(x, x') = x^\top x'$ . RKHS = linear functions. Mercer eigenvalues depend on the data distribution. No nonlinearity.

Polynomial kernel: $k(x, x') = (x^\top x' + c)^p$ with $c \geq 0$ . For $c > 0$ , the RKHS on a compact set equals the polynomials of degree $\leq p$ in $d$ variables (dimension $\binom{d+p}{p}$ ), equipped with a specific weighted-coefficient norm derived from multinomial coefficients (Steinwart & Christmann 2008, Example 4.11). For $c = 0$ the kernel is homogeneous and the RKHS consists of homogeneous polynomials of degree exactly $p$ .

Gaussian (RBF) kernel: $k(x, x') = \exp(-\|x - x'\|^2/(2\sigma^2))$ . The RKHS is infinite-dimensional. On compact domains or under Gaussian input measure the Mercer eigenvalues decay geometrically ( $\lambda_j \sim B^j$ for some $B \in (0,1)$ ; see Rasmussen-Williams 2006 Section 4.3 eq. 4.42, Zhu et al. 1998). Universal (Steinwart 2001; Micchelli-Xu-Zhang 2006): the RKHS is dense in $C(\mathcal{X})$ in the sup-norm, so any continuous function on a compact $\mathcal{X}$ can be approximated arbitrarily well by RKHS elements. Universal does not mean the RKHS contains every continuous function. Bandwidth $\sigma$ controls smoothness.

Laplacian kernel $k(x, x') = \exp(-\|x - x'\|/\sigma)$ : this is the Matern kernel with $\nu = 1/2$ up to rescaling. The RKHS on $\mathbb{R}^d$ is the Sobolev space $H^{(d+1)/2}$ (up to norm equivalence; Wendland 2005 Ch. 10, Fasshauer 2011). Sample functions are continuous but generally non-differentiable in the classical sense; they have square-integrable weak derivatives of fractional order. Used when the target function is rough.

Connection to SVMs

The support vector machine (SVM) is regularized ERM with hinge loss in an RKHS:

$\min_{f \in \mathcal{H}_k} \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i f(x_i)) + \lambda\|f\|_{\mathcal{H}_k}^2$

By the representer theorem, $f^* = \sum_i \alpha_i k(\cdot, x_i)$ . The dual problem (via Lagrangian duality from convex optimization) produces the "support vector" formulation: at optimality, the solution is typically sparse, with most $\alpha_i = 0$ . The nonzero $\alpha_i$ correspond to the support vectors. For soft-margin SVM with box constraint $0 \leq \alpha_i \leq C$ , KKT conditions split these into (i) margin points with $0 < \alpha_i < C$ lying exactly on the margin ( $y_i f(x_i) = 1$ ), and (ii) margin violators with $\alpha_i = C$ lying inside the margin or misclassified ( $y_i f(x_i) < 1$ ). The "closest to the decision boundary" picture applies only to hard-margin SVM on separable data (Cortes & Vapnik 1995). Sparsity of the dual solution is a consequence of the hinge loss's non-differentiable kink at margin $= 1$ and the KKT complementary-slackness conditions (Steinwart-Christmann 2008 Thm. 5.28).

This sparsity is a bonus: prediction cost is $O(|\text{SV}| \cdot d)$ rather than $O(n \cdot d)$ , where $|\text{SV}|$ is the number of support vectors.

Common Confusions

Watch Out

The representer theorem is structural, not a recommendation to use kernel methods

The representer theorem says: "if you regularize with RKHS norm and your loss only depends on function values at training points, then the optimum is a kernel expansion." It does not say kernel methods are the best approach. In high dimensions with large datasets, neural networks typically outperform kernel methods despite lacking a representer theorem. The theorem is about the structure of the solution, not its quality.

Confusing "the solution has a nice form" with "the method is good" is a common error. The representer theorem is a mathematical fact, not practical advice.

Watch Out

The kernel trick is not about avoiding computation of phi(x)

A common oversimplification: "the kernel trick avoids computing the high-dimensional feature map." More precisely, the kernel trick avoids the explicit dependence on feature dimension. The cost instead depends on $n$ (number of training points) through the $n \times n$ Gram matrix. For large $n$ , this can be worse than working with explicit features when the feature dimension is moderate. The kernel trick trades feature dimension for sample size.

Watch Out

RKHS norm is not just any smoothness measure

The RKHS norm $\|f\|_{\mathcal{H}_k}$ measures smoothness relative to the kernel. For the Gaussian kernel, large RKHS norm means the function has high-frequency content. For the polynomial kernel, it means large coefficients on high-degree monomials. Different kernels induce different notions of "complexity." Choosing the wrong kernel means the RKHS norm does not correspond to the relevant notion of smoothness for your problem.

Canonical Examples

Example

Kernel ridge regression

Kernel ridge regression minimizes $\frac{1}{n}\sum_i (y_i - f(x_i))^2 + \lambda\|f\|_{\mathcal{H}_k}^2$ . By the representer theorem, $f^* = \sum_j \alpha_j k(\cdot, x_j)$ and $f^*(x_i) = \sum_j \alpha_j k(x_i, x_j) = (K\alpha)_i$ . Substituting:

$\min_\alpha \frac{1}{n}\|y - K\alpha\|^2 + \lambda \alpha^\top K \alpha$

Taking the gradient and setting to zero: $\alpha^* = (K + n\lambda I)^{-1} y$ . This is solvable in $O(n^3)$ time. The solution is the Bayesian posterior mean under a Gaussian process prior with covariance kernel $k$ .

Example

Polynomial kernel for quadratic features

With $k(x, x') = (x^\top x')^2$ on $\mathbb{R}^2$ , the feature map is $\phi(x) = (x_1^2, x_2^2, \sqrt{2} x_1 x_2)^\top$ . Computing $k(x, x')$ costs $O(d)$ (one inner product), while computing $\langle\phi(x), \phi(x')\rangle$ explicitly costs $O(d^2)$ . For degree- $p$ polynomial kernel in $d$ dimensions, the feature space has dimension $\binom{d+p}{p}$ , which grows rapidly, but $k(x,x')$ always costs $O(d)$ .

Exercises

ExerciseCore

Problem

Verify the reproducing property for the linear kernel $k(x, x') = x^\top x'$ . What is the RKHS? What is the RKHS norm of $f(x) = w^\top x$ ?

ExerciseCore

Problem

Show that the Gram matrix $K_{ij} = k(x_i, x_j)$ of a p.d. kernel is positive semidefinite. Why does this imply $\alpha^\top K\alpha \geq 0$ for all $\alpha \in \mathbb{R}^n$ ?

ExerciseAdvanced

Problem

Prove that without a regularizer ( $g = 0$ ), the representer theorem does not hold. Construct a loss function and kernel where the minimizer of $\frac{1}{n}\sum_i \ell(f(x_i), y_i)$ over $\mathcal{H}_k$ is not in the span of $\{k(\cdot, x_i)\}$ .

Related Comparisons

Kernel Methods vs. Feature Learning

References

Canonical:

Scholkopf & Smola, Learning with Kernels (2002), Chapters 1-4
Aronszajn, "Theory of Reproducing Kernels" (1950), Trans. AMS
Kimeldorf & Wahba, "Some Results on Tchebycheffian Spline Functions" (J. Math. Anal. Appl., 1971). Original representer theorem for spline smoothing.
Scholkopf, Herbrich, Smola, "A Generalized Representer Theorem" (COLT 2001). Extension to arbitrary strictly monotonic regularizers and pointwise-dependent losses: the form used in this page.
Cortes & Vapnik, "Support-Vector Networks" (1995), Machine Learning
Rasmussen & Williams, Gaussian Processes for Machine Learning (2006), Section 4.3
Wendland, Scattered Data Approximation (2005), Chapter 10

Current:

Steinwart & Christmann, Support Vector Machines (2008), Chapters 4-5 (Example 4.11, Theorem 5.28)
Steinwart, "On the influence of the kernel on the consistency of support vector machines" (2001), JMLR
Micchelli, Xu & Zhang, "Universal Kernels" (2006), JMLR
Fasshauer, "Positive Definite Kernels: Past, Present and Future" (2011), Dolomites Research Notes on Approximation
Zhu, Williams, Rohwer & Morciniec, "Gaussian regression and optimal finite dimensional linear models" (1998), Neural Networks and Machine Learning
Bartlett & Mendelson, "Rademacher and Gaussian Complexities" (2002), JMLR. for RKHS Rademacher bounds
Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5
Nesterov, Introductory Lectures on Convex Optimization (2004), Chapters 1-3

Next Topics

The natural next step from kernels and RKHS:

Implicit bias and modern generalization: how minimum-norm solutions in RKHS connect to the implicit bias of gradient descent in deep learning

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

14

Characteristic Functionslayer 1 · tier 1
Convex Optimization Basicslayer 1 · tier 1
Gram Matrices and Kernel Matriceslayer 1 · tier 1
Ridge Regressionlayer 1 · tier 1
Convex Dualitylayer 2 · tier 1

Derived topics

12

Kernel Density Estimationlayer 2 · tier 1
Local Polynomial Regressionlayer 2 · tier 1
Nadaraya-Watson Kernel Regressionlayer 2 · tier 1
Smoothing Splineslayer 2 · tier 1
Thin-Plate Splineslayer 2 · tier 1

+7 more on the derived-topics page.

Graph-backed continuations

Implicit Bias and Modern Generalization Attention as Kernel Regression Gaussian Processes for Machine Learning Gaussian Process Regression Kernel Methods for Molecules Kernel Two-Sample Tests Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Width