Modern Generalization
Gaussian Processes for Machine Learning
A distribution over functions specified by a mean and kernel: closed-form posterior predictions with uncertainty, connection to kernel ridge regression, marginal likelihood for model selection, and the cubic cost bottleneck.
Prerequisites
Why This Matters
A Gaussian process is a distribution over functions. Instead of fitting a single function to data (as in regression or neural networks), a GP maintains a distribution over all functions consistent with the data. This gives you two things most ML models do not: a prediction and a calibrated measure of how uncertain that prediction is.
GPs are the gold standard for uncertainty quantification in regression. They are widely used in Bayesian optimization (where you need to balance exploration and exploitation), in scientific modeling (where error bars matter), and as a theoretical bridge between kernel methods and neural networks via the neural tangent kernel.
The connection to kernel methods is deep: the GP posterior mean is exactly the kernel ridge regression solution. This means GPs give a Bayesian interpretation to kernel methods and add uncertainty on top.
Mental Model
Imagine the space of all smooth functions from to . A GP defines a probability distribution over this space. Before seeing any data, you have a prior: many functions are plausible. After observing data points , the posterior concentrates on functions that pass near the data. At points far from any observation, the posterior is wide (high uncertainty). Near observations, it is narrow (low uncertainty).
Formal Setup and Notation
Gaussian Process
A Gaussian process is a collection of random variables, any finite subset of which has a joint Gaussian distribution. A GP is fully specified by:
- A mean function
- A covariance function (kernel)
We write . For any finite set of inputs :
where and .
Common choices for the kernel:
- Squared exponential (RBF): . Produces infinitely smooth functions. Length scale controls how quickly correlations decay.
- Matern kernel: with a smoothness parameter . At gives the Ornstein-Uhlenbeck process; at recovers the squared exponential.
- Linear kernel: . Gives Bayesian linear regression.
Core Definitions
Observation Model
We observe noisy function values:
Given observations at inputs , the joint distribution of observed values and the function value at a new test point is:
where and .
Main Theorems
GP Posterior Predictive Distribution
Statement
Given training data and a test input , the posterior predictive distribution is Gaussian:
with:
The posterior mean is the best prediction. The posterior variance quantifies uncertainty at .
Intuition
The posterior mean is a weighted combination of the observed values, where the weights come from the kernel similarities to the test point, adjusted by the training data correlations. The posterior variance starts at the prior variance and is reduced by the information provided by nearby training points. Far from any training point, (back to the prior). Near training points, the variance shrinks.
Proof Sketch
This follows directly from the formula for conditioning in a multivariate Gaussian. If with block structure, then . Apply this with and .
Why It Matters
This is one of the few cases in machine learning where you get a closed-form posterior distribution over predictions. No MCMC, no variational inference, no approximations (for Gaussian likelihood). The uncertainty estimates are exact and calibrated under the model assumptions.
Failure Mode
The posterior is exact only for Gaussian likelihoods. For classification (Bernoulli likelihood) or robust regression (heavy-tailed noise), the posterior is no longer Gaussian and you need approximations (Laplace, EP, or variational methods). Also, the uncertainty is calibrated under the model, which may not match reality if the kernel is misspecified.
GP Posterior Mean Equals Kernel Ridge Regression
Statement
The GP posterior mean function is identical to the kernel ridge regression solution:
with regularization parameter and RKHS norm induced by the kernel .
Intuition
The GP and kernel ridge regression give the same point predictions. The GP adds uncertainty quantification on top. This means every time you do kernel ridge regression, you are implicitly computing the MAP estimate of a GP model. The GP perspective just gives you error bars for free.
Proof Sketch
By the representer theorem, the kernel ridge regression solution has the form . Setting the gradient of the regularized objective to zero gives . With , this is , matching the GP posterior mean weights.
Why It Matters
This connection is one of the most important bridges in ML theory. It unifies the frequentist (regularization) and Bayesian (GP prior) perspectives. It also means that theoretical results for kernel methods (like generalization bounds) apply to GPs, and the Bayesian uncertainty from GPs can be attributed to kernel ridge regression.
Failure Mode
The equivalence holds for the posterior mean only. Kernel ridge regression gives no uncertainty estimates. If you need error bars, you need the full GP posterior, not just the mean.
Marginal Likelihood for Hyperparameter Selection
The kernel has hyperparameters (length scale , signal variance , noise variance ). The GP framework provides a principled way to set them: maximize the marginal likelihood (also called the evidence):
The first term penalizes data misfit. The second term penalizes model complexity (it is the log-determinant of the covariance matrix, which grows with the number of effective parameters). This automatic Occam's razor is one of the most appealing features of GPs.
Computational Cost
The bottleneck is inverting the matrix :
- Exact inference: time and memory. This limits exact GPs to roughly on modern hardware.
- Sparse approximations: inducing point methods reduce cost to where is the number of inducing points.
- Structured kernels: if the kernel has special structure (e.g., stationary kernel on a grid), Toeplitz or Kronecker structure can be exploited for inference.
Canonical Examples
GP regression on 1D data
Observe 10 noisy points from on . Using a squared exponential kernel, the GP posterior mean closely tracks the sine curve where data is dense and reverts to the prior mean (zero) where data is sparse. The 95% credible interval (the error bars) is narrow near observations and wide in data-sparse regions. This is the textbook illustration of GP uncertainty quantification.
Common Confusions
GPs are nonparametric, but they have hyperparameters
A GP is nonparametric in the sense that the number of effective parameters grows with the data (the predictive function depends on all training points). But the kernel has a fixed, small number of hyperparameters. These control the shape of the prior over functions (smoothness, length scale, amplitude), not individual function values. Tuning them via marginal likelihood is model selection, not parameter estimation.
GP uncertainty is model uncertainty, not aleatoric uncertainty
The posterior variance captures epistemic uncertainty (what we do not know about given limited data). The noise variance captures aleatoric uncertainty (irreducible noise in the observations). The total predictive variance is . Confusing the two leads to misinterpretation of error bars.
Summary
- A GP is a distribution over functions:
- Posterior is closed-form Gaussian for Gaussian likelihood
- Posterior mean equals kernel ridge regression
- Posterior variance gives calibrated uncertainty (wide far from data, narrow near data)
- Marginal likelihood provides automatic hyperparameter selection with built-in Occam's razor
- Main limitation: exact inference cost
Exercises
Problem
You have a GP with zero mean and squared exponential kernel with , , and . You observe a single point . What is the posterior mean and variance at ? At ?
Problem
Prove that the GP posterior variance is always non-negative. Why is this not obvious from the formula?
Problem
The marginal likelihood balances data fit and model complexity. Show that as (no noise), the data fit term diverges to for data not exactly on a sample path of the GP, while the complexity term remains finite. What does this say about GP regression with zero noise?
References
Canonical:
- Rasmussen & Williams, Gaussian Processes for Machine Learning (2006)
- Williams & Rasmussen, "Gaussian processes for regression" (NeurIPS 1996)
Current:
-
Hensman, Fusi, Lawrence, "Gaussian processes for big data" (UAI 2013)
-
Wilson & Nickisch, "Kernel interpolation for scalable structured GPs" (ICML 2015)
-
Murphy, Machine Learning: A Probabilistic Perspective (2012)
-
Bishop, Pattern Recognition and Machine Learning (2006)
Next Topics
The natural next step from Gaussian processes:
- Neural tangent kernel: the infinite-width limit of neural networks is a GP
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Kernels and Reproducing Kernel Hilbert SpacesLayer 3
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Rademacher ComplexityLayer 3
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- VC DimensionLayer 2