Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Modern Generalization

Gaussian Processes for Machine Learning

A distribution over functions specified by a mean and kernel: closed-form posterior predictions with uncertainty, connection to kernel ridge regression, marginal likelihood for model selection, and the cubic cost bottleneck.

AdvancedTier 3Stable~65 min

Prerequisites

0

Why This Matters

A Gaussian process is a distribution over functions. Instead of fitting a single function to data (as in regression or neural networks), a GP maintains a distribution over all functions consistent with the data. This gives you two things most ML models do not: a prediction and a calibrated measure of how uncertain that prediction is.

GPs are the gold standard for uncertainty quantification in regression. They are widely used in Bayesian optimization (where you need to balance exploration and exploitation), in scientific modeling (where error bars matter), and as a theoretical bridge between kernel methods and neural networks via the neural tangent kernel.

The connection to kernel methods is deep: the GP posterior mean is exactly the kernel ridge regression solution. This means GPs give a Bayesian interpretation to kernel methods and add uncertainty on top.

Mental Model

Imagine the space of all smooth functions from Rd\mathbb{R}^d to R\mathbb{R}. A GP defines a probability distribution over this space. Before seeing any data, you have a prior: many functions are plausible. After observing data points (xi,yi)(x_i, y_i), the posterior concentrates on functions that pass near the data. At points far from any observation, the posterior is wide (high uncertainty). Near observations, it is narrow (low uncertainty).

Formal Setup and Notation

Definition

Gaussian Process

A Gaussian process is a collection of random variables, any finite subset of which has a joint Gaussian distribution. A GP is fully specified by:

  • A mean function m(x)=E[f(x)]m(x) = \mathbb{E}[f(x)]
  • A covariance function (kernel) k(x,x)=Cov(f(x),f(x))k(x, x') = \text{Cov}(f(x), f(x'))

We write fGP(m,k)f \sim \mathcal{GP}(m, k). For any finite set of inputs X={x1,,xn}X = \{x_1, \ldots, x_n\}:

f=[f(x1),,f(xn)]N(m,K)\mathbf{f} = [f(x_1), \ldots, f(x_n)]^\top \sim \mathcal{N}(\mathbf{m}, K)

where mi=m(xi)\mathbf{m}_i = m(x_i) and Kij=k(xi,xj)K_{ij} = k(x_i, x_j).

Common choices for the kernel:

  • Squared exponential (RBF): k(x,x)=σf2exp ⁣(xx222)k(x, x') = \sigma_f^2 \exp\!\left(-\frac{\|x - x'\|^2}{2\ell^2}\right). Produces infinitely smooth functions. Length scale \ell controls how quickly correlations decay.
  • Matern kernel: k(x,x)k(x,x') with a smoothness parameter ν\nu. At ν=1/2\nu = 1/2 gives the Ornstein-Uhlenbeck process; at ν\nu \to \infty recovers the squared exponential.
  • Linear kernel: k(x,x)=xxk(x, x') = x^\top x'. Gives Bayesian linear regression.

Core Definitions

Definition

Observation Model

We observe noisy function values:

yi=f(xi)+ϵi,ϵiN(0,σn2)y_i = f(x_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2)

Given nn observations y=[y1,,yn]\mathbf{y} = [y_1, \ldots, y_n]^\top at inputs X=[x1,,xn]X = [x_1, \ldots, x_n], the joint distribution of observed values and the function value at a new test point xx_* is:

[yf]N ⁣([mm],[K+σn2Ikkk])\begin{bmatrix} \mathbf{y} \\ f_* \end{bmatrix} \sim \mathcal{N}\!\left( \begin{bmatrix} \mathbf{m} \\ m_* \end{bmatrix}, \begin{bmatrix} K + \sigma_n^2 I & \mathbf{k}_* \\ \mathbf{k}_*^\top & k_{**} \end{bmatrix} \right)

where k=[k(x1,x),,k(xn,x)]\mathbf{k}_* = [k(x_1, x_*), \ldots, k(x_n, x_*)]^\top and k=k(x,x)k_{**} = k(x_*, x_*).

Main Theorems

Theorem

GP Posterior Predictive Distribution

Statement

Given training data (X,y)(X, \mathbf{y}) and a test input xx_*, the posterior predictive distribution is Gaussian:

fX,y,xN(fˉ,Var(f))f_* | X, \mathbf{y}, x_* \sim \mathcal{N}(\bar{f}_*, \text{Var}(f_*))

with:

fˉ=m(x)+k(K+σn2I)1(ym)\bar{f}_* = m(x_*) + \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1}(\mathbf{y} - \mathbf{m})

Var(f)=kk(K+σn2I)1k\text{Var}(f_*) = k_{**} - \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{k}_*

The posterior mean fˉ\bar{f}_* is the best prediction. The posterior variance Var(f)\text{Var}(f_*) quantifies uncertainty at xx_*.

Intuition

The posterior mean is a weighted combination of the observed values, where the weights come from the kernel similarities to the test point, adjusted by the training data correlations. The posterior variance starts at the prior variance kk_{**} and is reduced by the information provided by nearby training points. Far from any training point, Var(f)k\text{Var}(f_*) \approx k_{**} (back to the prior). Near training points, the variance shrinks.

Proof Sketch

This follows directly from the formula for conditioning in a multivariate Gaussian. If [a,b]N(μ,Σ)[a, b]^\top \sim \mathcal{N}(\mu, \Sigma) with block structure, then abN(μa+ΣabΣbb1(bμb),ΣaaΣabΣbb1Σba)a | b \sim \mathcal{N}(\mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(b - \mu_b), \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}). Apply this with a=fa = f_* and b=yb = \mathbf{y}.

Why It Matters

This is one of the few cases in machine learning where you get a closed-form posterior distribution over predictions. No MCMC, no variational inference, no approximations (for Gaussian likelihood). The uncertainty estimates are exact and calibrated under the model assumptions.

Failure Mode

The posterior is exact only for Gaussian likelihoods. For classification (Bernoulli likelihood) or robust regression (heavy-tailed noise), the posterior is no longer Gaussian and you need approximations (Laplace, EP, or variational methods). Also, the uncertainty is calibrated under the model, which may not match reality if the kernel is misspecified.

Proposition

GP Posterior Mean Equals Kernel Ridge Regression

Statement

The GP posterior mean function fˉ(x)=k(x)(K+σn2I)1y\bar{f}(x) = \mathbf{k}(x)^\top (K + \sigma_n^2 I)^{-1} \mathbf{y} is identical to the kernel ridge regression solution:

f^=argminfHk1ni=1n(yif(xi))2+λfHk2\hat{f} = \arg\min_{f \in \mathcal{H}_k} \frac{1}{n} \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \|f\|_{\mathcal{H}_k}^2

with regularization parameter λ=σn2/n\lambda = \sigma_n^2/n and RKHS norm induced by the kernel kk.

Intuition

The GP and kernel ridge regression give the same point predictions. The GP adds uncertainty quantification on top. This means every time you do kernel ridge regression, you are implicitly computing the MAP estimate of a GP model. The GP perspective just gives you error bars for free.

Proof Sketch

By the representer theorem, the kernel ridge regression solution has the form f^(x)=iαik(x,xi)\hat{f}(x) = \sum_i \alpha_i k(x, x_i). Setting the gradient of the regularized objective to zero gives α=(K+nλI)1y\alpha = (K + n\lambda I)^{-1} \mathbf{y}. With λ=σn2/n\lambda = \sigma_n^2/n, this is (K+σn2I)1y(K + \sigma_n^2 I)^{-1} \mathbf{y}, matching the GP posterior mean weights.

Why It Matters

This connection is one of the most important bridges in ML theory. It unifies the frequentist (regularization) and Bayesian (GP prior) perspectives. It also means that theoretical results for kernel methods (like generalization bounds) apply to GPs, and the Bayesian uncertainty from GPs can be attributed to kernel ridge regression.

Failure Mode

The equivalence holds for the posterior mean only. Kernel ridge regression gives no uncertainty estimates. If you need error bars, you need the full GP posterior, not just the mean.

Marginal Likelihood for Hyperparameter Selection

The kernel has hyperparameters (length scale \ell, signal variance σf2\sigma_f^2, noise variance σn2\sigma_n^2). The GP framework provides a principled way to set them: maximize the marginal likelihood (also called the evidence):

logp(yX)=12y(K+σn2I)1y12logK+σn2In2log(2π)\log p(\mathbf{y} | X) = -\frac{1}{2} \mathbf{y}^\top (K + \sigma_n^2 I)^{-1} \mathbf{y} - \frac{1}{2} \log |K + \sigma_n^2 I| - \frac{n}{2} \log(2\pi)

The first term penalizes data misfit. The second term penalizes model complexity (it is the log-determinant of the covariance matrix, which grows with the number of effective parameters). This automatic Occam's razor is one of the most appealing features of GPs.

Computational Cost

The bottleneck is inverting the n×nn \times n matrix K+σn2IK + \sigma_n^2 I:

  • Exact inference: O(n3)O(n^3) time and O(n2)O(n^2) memory. This limits exact GPs to roughly n10,000n \leq 10{,}000 on modern hardware.
  • Sparse approximations: inducing point methods reduce cost to O(nm2)O(nm^2) where mnm \ll n is the number of inducing points.
  • Structured kernels: if the kernel has special structure (e.g., stationary kernel on a grid), Toeplitz or Kronecker structure can be exploited for O(nlogn)O(n \log n) inference.

Canonical Examples

Example

GP regression on 1D data

Observe 10 noisy points from f(x)=sin(x)f(x) = \sin(x) on [0,2π][0, 2\pi]. Using a squared exponential kernel, the GP posterior mean closely tracks the sine curve where data is dense and reverts to the prior mean (zero) where data is sparse. The 95% credible interval (the error bars) is narrow near observations and wide in data-sparse regions. This is the textbook illustration of GP uncertainty quantification.

Common Confusions

Watch Out

GPs are nonparametric, but they have hyperparameters

A GP is nonparametric in the sense that the number of effective parameters grows with the data (the predictive function depends on all nn training points). But the kernel has a fixed, small number of hyperparameters. These control the shape of the prior over functions (smoothness, length scale, amplitude), not individual function values. Tuning them via marginal likelihood is model selection, not parameter estimation.

Watch Out

GP uncertainty is model uncertainty, not aleatoric uncertainty

The posterior variance Var(f)\text{Var}(f_*) captures epistemic uncertainty (what we do not know about ff given limited data). The noise variance σn2\sigma_n^2 captures aleatoric uncertainty (irreducible noise in the observations). The total predictive variance is Var(f)+σn2\text{Var}(f_*) + \sigma_n^2. Confusing the two leads to misinterpretation of error bars.

Summary

  • A GP is a distribution over functions: fGP(m,k)f \sim \mathcal{GP}(m, k)
  • Posterior is closed-form Gaussian for Gaussian likelihood
  • Posterior mean =k(K+σn2I)1y= \mathbf{k}_*^\top(K + \sigma_n^2 I)^{-1}\mathbf{y} equals kernel ridge regression
  • Posterior variance gives calibrated uncertainty (wide far from data, narrow near data)
  • Marginal likelihood provides automatic hyperparameter selection with built-in Occam's razor
  • Main limitation: O(n3)O(n^3) exact inference cost

Exercises

ExerciseCore

Problem

You have a GP with zero mean and squared exponential kernel with =1\ell = 1, σf2=1\sigma_f^2 = 1, and σn2=0.1\sigma_n^2 = 0.1. You observe a single point (x1,y1)=(0,1)(x_1, y_1) = (0, 1). What is the posterior mean and variance at x=0x_* = 0? At x=10x_* = 10?

ExerciseAdvanced

Problem

Prove that the GP posterior variance Var(f)=kk(K+σn2I)1k\text{Var}(f_*) = k_{**} - \mathbf{k}_*^\top(K + \sigma_n^2 I)^{-1}\mathbf{k}_* is always non-negative. Why is this not obvious from the formula?

ExerciseResearch

Problem

The marginal likelihood logp(yX)\log p(\mathbf{y}|X) balances data fit and model complexity. Show that as σn20\sigma_n^2 \to 0 (no noise), the data fit term diverges to -\infty for data not exactly on a sample path of the GP, while the complexity term remains finite. What does this say about GP regression with zero noise?

References

Canonical:

  • Rasmussen & Williams, Gaussian Processes for Machine Learning (2006)
  • Williams & Rasmussen, "Gaussian processes for regression" (NeurIPS 1996)

Current:

  • Hensman, Fusi, Lawrence, "Gaussian processes for big data" (UAI 2013)

  • Wilson & Nickisch, "Kernel interpolation for scalable structured GPs" (ICML 2015)

  • Murphy, Machine Learning: A Probabilistic Perspective (2012)

  • Bishop, Pattern Recognition and Machine Learning (2006)

Next Topics

The natural next step from Gaussian processes:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics