Gaussian Processes for Machine Learning

Sneiderman, Robby

Modern Generalization

Gaussian Processes for Machine Learning

A distribution over functions specified by a mean and kernel: closed-form posterior predictions with uncertainty, connection to kernel ridge regression, marginal likelihood for model selection, and the cubic cost bottleneck.

AdvancedTier 3StableSupporting~75 min

Prerequisites

Kernels and Rkhs Joint Marginal Conditional Distributions Ridge Regression Gram Matrices and Kernel Matrices

Start 8-question practice · 9 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

modern-generalization | layer 4 | tier 3. This page has 10 direct prerequisites and 6 published dependents.

Open Atlas Prerequisites Leads to

What next

Bayesian Optimization for Hyperparameters

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A Gaussian process is not just a nonlinear regressor with error bars. It is one of the cleanest places in ML where Bayesian inference, kernel methods, and numerical linear algebra all meet in closed form. When the likelihood is Gaussian, you get an exact posterior over functions rather than a point fit plus an ad hoc uncertainty heuristic.

That matters whenever decisions depend on uncertainty rather than raw point accuracy alone: Bayesian optimization, active experimental design, scientific surrogate modeling, and safe control are all settings where the question is not only "what do we predict?" but also "how unsure should we be here?".

The page matters for theory too. The GP posterior mean is exactly the same point predictor as kernel ridge regression, so GPs give a Bayesian semantics to kernel methods. And in the infinite-width limit, they sit on the road toward the neural tangent kernel.

theorem visual

Prior to Posterior Function Band

$Observations collapse a broad kernel prior into a posterior mean with narrow variance near data and wide variance away from it.$

Lengthscale

l = 1.0

Prior functions

Before data, the kernel defines a family of plausible smooth functions.

Posterior mean and uncertainty

The band stays tight near observed points and widens where the model has not seen data.

Posterior mean

\overset{ˉ}{f}_{*} (x) = k_{*}^{⊤} (K + σ_{n}^{2} I)^{- 1} y

Bridge to learning theory

The posterior mean is the same point predictor as kernel ridge regression; the GP adds uncertainty on top.

Mental Model

Imagine the space of all smooth functions from $\mathbb{R}^d$ to $\mathbb{R}$ . A GP defines a probability distribution over this space. Before seeing any data, you have a prior: many functions are plausible. After observing data points $(x_i, y_i)$ , the posterior concentrates on functions that pass near the data. At points far from any observation, the posterior is wide (high uncertainty). Near observations, it is narrow (low uncertainty).

The kernel is the real prior object. It says which function behaviors should be considered similar before any data arrive. A short length scale says nearby inputs can still vary sharply. A long length scale says the function should move coherently over larger regions. A linear kernel says the whole prior lives in the space of affine trends. The posterior then updates that geometric prior using the observed data.

Reading the Kernel as a Prior

The kernel is not bookkeeping. It is a compact way to state what kinds of functions the model considers plausible before seeing data.

Kernel change	Prior belief about functions	Posterior consequence
smaller length scale $\ell$	function values decorrelate quickly in input space	interpolation becomes more local; uncertainty can rise sharply between points
larger signal variance $\sigma_f^2$	larger-amplitude deviations are plausible a priori	posterior bands stay wider away from observations
larger noise variance $\sigma_n^2$	observations are trusted less	posterior mean smooths through the data instead of interpolating them
linear kernel	only linear trends are allowed	GP regression reduces to Bayesian linear regression
Mat'ern with smaller $\nu$	rougher sample paths are plausible	posterior functions can be less smooth than under RBF

Formal Setup and Notation

Definition

Gaussian Process $G P (m, k)$

A Gaussian process is a collection of random variables, any finite subset of which has a joint Gaussian distribution. A GP is fully specified by:

A mean function $m(x) = \mathbb{E}[f(x)]$
A covariance function (kernel) $k(x, x') = \text{Cov}(f(x), f(x'))$

We write $f \sim \mathcal{GP}(m, k)$ . For any finite set of inputs $X = \{x_1, \ldots, x_n\}$ :

$\mathbf{f} = [f(x_1), \ldots, f(x_n)]^\top \sim \mathcal{N}(\mathbf{m}, K)$

where $\mathbf{m}_i = m(x_i)$ and $K_{ij} = k(x_i, x_j)$ .

Common choices for the kernel:

Squared exponential (RBF): $k(x, x') = \sigma_f^2 \exp\!\left(-\frac{\|x - x'\|^2}{2\ell^2}\right)$ . Produces infinitely smooth functions. Length scale $\ell$ controls how quickly correlations decay.
Matern kernel: $k(x,x')$ with a smoothness parameter $\nu$ . At $\nu = 1/2$ gives the Ornstein-Uhlenbeck process; at $\nu \to \infty$ recovers the squared exponential.
Linear kernel: $k(x, x') = x^\top x'$ . Gives Bayesian linear regression with functions through the origin; to recover full Bayesian linear regression with an intercept, add a constant kernel $\sigma_b^2$ , i.e. use $k(x,x') = x^\top x' + \sigma_b^2$ .
Periodic kernel: encodes repeating structure, useful when the signal is expected to recur with a fixed period.

Core Definitions

Definition

Observation Model

We observe noisy function values:

$y_i = f(x_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$

Given $n$ observations $\mathbf{y} = [y_1, \ldots, y_n]^\top$ at inputs $X = [x_1, \ldots, x_n]$ , the joint distribution of observed values and the function value at a new test point $x_*$ is:

$\begin{bmatrix} \mathbf{y} \\ f_* \end{bmatrix} \sim \mathcal{N}\!\left( \begin{bmatrix} \mathbf{m} \\ m_* \end{bmatrix}, \begin{bmatrix} K + \sigma_n^2 I & \mathbf{k}_* \\ \mathbf{k}_*^\top & k_{**} \end{bmatrix} \right)$

where $\mathbf{k}_* = [k(x_1, x_*), \ldots, k(x_n, x_*)]^\top$ and $k_{**} = k(x_*, x_*)$ .

Main Theorems

Theorem

GP Posterior Predictive Distribution

Statement

Given training data $(X, \mathbf{y})$ and a test input $x_*$ , the posterior predictive distribution is Gaussian:

$f_* | X, \mathbf{y}, x_* \sim \mathcal{N}(\bar{f}_*, \text{Var}(f_*))$

with:

$\bar{f}_* = m(x_*) + \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1}(\mathbf{y} - \mathbf{m})$

$\text{Var}(f_*) = k_{**} - \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{k}_*$

The posterior mean $\bar{f}_*$ is the best prediction. The posterior variance $\text{Var}(f_*)$ quantifies uncertainty at $x_*$ .

For the noisy observation $y_* = f_* + \epsilon_*$ , the predictive variance is $\text{Var}(y_*) = \text{Var}(f_*) + \sigma_n^2$ .

Intuition

The posterior mean is a weighted combination of the observed values, where the weights come from the kernel similarities to the test point, adjusted by the training data correlations. The posterior variance starts at the prior variance $k_{**}$ and is reduced by the information provided by nearby training points. Far from any training point, $\text{Var}(f_*) \approx k_{**}$ (back to the prior). Near training points, the variance shrinks.

Proof Sketch

This follows directly from the formula for conditioning in a multivariate Gaussian. If $[a, b]^\top \sim \mathcal{N}(\mu, \Sigma)$ with block structure, then $a | b \sim \mathcal{N}(\mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(b - \mu_b), \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba})$ . Apply this with $a = f_*$ and $b = \mathbf{y}$ .

Why It Matters

This is one of the few cases in machine learning where you get a closed-form posterior distribution over predictions. No MCMC, no variational inference, no approximations (for Gaussian likelihood). The uncertainty estimates are exact and calibrated under the model assumptions.

Failure Mode

The posterior is exact only for Gaussian likelihoods. For classification (Bernoulli likelihood) or robust regression (heavy-tailed noise), the posterior is no longer Gaussian and you need approximations (Laplace, EP, or variational methods). Also, the uncertainty is calibrated under the model, which may not match reality if the kernel is misspecified.

report a correction →

Proposition

GP Posterior Mean Equals Kernel Ridge Regression

Statement

The GP posterior mean function $\bar{f}(x) = \mathbf{k}(x)^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}$ is identical to the kernel ridge regression solution:

$\hat{f} = \arg\min_{f \in \mathcal{H}_k} \frac{1}{n} \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \|f\|_{\mathcal{H}_k}^2$

with regularization parameter $\lambda = \sigma_n^2/n$ and RKHS norm induced by the kernel $k$ .

Intuition

The GP and kernel ridge regression give the same point predictions. The GP adds uncertainty quantification on top. This means every time you do kernel ridge regression, you are computing the same predictor that the GP posterior mean would produce. If your optimization objective is written without the $1/n$ normalization in front of the squared loss, the corresponding ridge parameter is $\lambda = \sigma_n^2$ instead.

Proof Sketch

By the representer theorem, the kernel ridge regression solution has the form $\hat{f}(x) = \sum_i \alpha_i k(x, x_i)$ . Setting the gradient of the regularized objective to zero gives $\alpha = (K + n\lambda I)^{-1} \mathbf{y}$ . With $\lambda = \sigma_n^2/n$ , this is $(K + \sigma_n^2 I)^{-1} \mathbf{y}$ , matching the GP posterior mean weights.

Why It Matters

This connection is one of the most important bridges in ML theory. It unifies the frequentist (regularization) and Bayesian (GP prior) perspectives. It also means that theoretical results for kernel methods (like generalization bounds) apply to the GP posterior mean. The Bayesian uncertainty from a GP, however, is not a property of kernel ridge regression: KRR returns only a point predictor with no posterior. The predictive variance comes from the GP prior plus the Gaussian likelihood, and recovering it from the regularized optimization view requires an extra Bayesian assumption beyond what KRR uses.

Failure Mode

The equivalence holds for the posterior mean only. Kernel ridge regression gives no uncertainty estimates. If you need error bars, you need the full GP posterior, not just the mean.

report a correction →

Proposition

Log Marginal Likelihood Balances Fit and Complexity

Statement

Let $K_y = K + \sigma_n^2 I$ . Then the log marginal likelihood of the observed targets is

$\log p(\mathbf{y} | X) = -\frac{1}{2} \mathbf{y}^\top K_y^{-1} \mathbf{y} - \frac{1}{2} \log |K_y| - \frac{n}{2} \log(2\pi).$

The first term is a data-fit term in the Mahalanobis geometry induced by the kernel. The second term is a complexity penalty coming from the volume of the function family supported by the prior.

Intuition

This is the GP version of Occam's razor. A kernel/hyperparameter choice is rewarded when it fits the observed targets, but it is penalized when it makes a large volume of functions plausible for no payoff in fit. The log determinant is not "number of parameters"; it is a function-space complexity term.

Proof Sketch

Under the Gaussian prior and Gaussian observation model, the observed targets have distribution $\mathbf{y} \sim \mathcal{N}(\mathbf{0}, K_y)$ after integrating out the latent function values. Taking the log density of that multivariate normal yields the formula.

Why It Matters

This gives a principled objective for choosing kernel hyperparameters. Unlike cross-validation, it falls directly out of the probabilistic model and can be differentiated exactly.

Failure Mode

Maximizing marginal likelihood only chooses the best model inside the chosen kernel family. If the kernel is misspecified, the evidence can still pick a bad length scale or noise level with great confidence.

report a correction →

Marginal Likelihood for Hyperparameter Selection

The kernel has hyperparameters (length scale $\ell$ , signal variance $\sigma_f^2$ , noise variance $\sigma_n^2$ ). The GP framework provides a principled way to set them: maximize the marginal likelihood (also called the evidence):

$\log p(\mathbf{y} | X) = -\frac{1}{2} \mathbf{y}^\top (K + \sigma_n^2 I)^{-1} \mathbf{y} - \frac{1}{2} \log |K + \sigma_n^2 I| - \frac{n}{2} \log(2\pi)$

The first term penalizes data misfit. The second term penalizes model complexity (it is the log-determinant of the covariance matrix, which grows with the number of effective parameters). This automatic Occam's razor is one of the most appealing features of GPs.

For implementation, the useful gradient formula is

$\frac{\partial}{\partial \eta}\log p(\mathbf{y}|X) = \frac{1}{2}\alpha^\top \frac{\partial K_y}{\partial \eta}\alpha - \frac{1}{2}\operatorname{tr}\!\left(K_y^{-1}\frac{\partial K_y}{\partial \eta}\right), \qquad \alpha = K_y^{-1}\mathbf{y}.$

The first term rewards hyperparameter changes that improve fit to the observed targets. The trace term penalizes changes that make the covariance geometry too complex.

Numerical Reality: Solve, Don't Invert

In formulas, GP inference is written with $(K + \sigma_n^2 I)^{-1}$ . In actual code, you should almost never form that inverse explicitly.

Let $K_y = K + \sigma_n^2 I + \varepsilon I$ , where $\varepsilon I$ is a tiny numerical jitter term used only to stabilize factorization. Then the standard exact pipeline is:

Compute a Cholesky factorization $K_y = LL^\top$ .
Solve $Lz = \mathbf{y}$ and then $L^\top \alpha = z$ .
Use $\alpha$ to evaluate the posterior mean $\bar f(x_*) = \mathbf{k}_*^\top \alpha$ .
Reuse triangular solves to compute predictive variances and the log determinant via $\log |K_y| = 2\sum_i \log L_{ii}$ .

This matters because the hard part of a GP is really a numerical linear algebra problem on the kernel matrix, not a symbolic Bayesian derivation. When the system is ill-conditioned, the issue is often duplicated inputs, a pathological length scale, or an overconfident noise level rather than a failure of GP theory itself.

Computational Cost

The bottleneck is factorizing the $n \times n$ matrix $K_y$ :

Exact inference: $O(n^3)$ time and $O(n^2)$ memory. This limits exact GPs to roughly $n \leq 10{,}000$ on modern hardware.
Sparse approximations: inducing-point methods reduce cost to $O(nm^2)$ where $m \ll n$ is the number of inducing points.
Structured kernels: stationary kernels on grids can exploit Toeplitz or Kronecker structure for much faster solves.
Iterative methods: for large systems, conjugate gradient methods can replace dense direct solves when matrix-vector products are cheap.

Canonical Examples

Example

GP regression on 1D data

Observe 10 noisy points from $f(x) = \sin(x)$ on $[0, 2\pi]$ . Using a squared exponential kernel, the GP posterior mean closely tracks the sine curve where data is dense and reverts to the prior mean (zero) where data is sparse. The 95% credible interval (the error bars) is narrow near observations and wide in data-sparse regions. This is the textbook illustration of GP uncertainty quantification.

Common Confusions

Watch Out

GPs are nonparametric, but they have hyperparameters

A GP is nonparametric in the sense that the number of effective parameters grows with the data (the predictive function depends on all $n$ training points). But the kernel has a fixed, small number of hyperparameters. These control the shape of the prior over functions (smoothness, length scale, amplitude), not individual function values. Tuning them via marginal likelihood is model selection, not parameter estimation.

Watch Out

Latent uncertainty is not predictive noise

The posterior variance $\text{Var}(f_*)$ is the uncertainty about the latent function value. If you want uncertainty for a new noisy observation $y_*$ , you must add the observation noise: $\text{Var}(y_*) = \text{Var}(f_*) + \sigma_n^2$ . Many plots silently switch between these two quantities; the distinction matters when you interpret error bars.

Watch Out

Jitter is not the observation noise

The observation noise $\sigma_n^2$ is part of the statistical model. Jitter $\varepsilon I$ is a tiny numerical stabilizer added so the matrix factorization does not blow up on nearly singular kernel matrices. Setting a huge jitter is not the same as modeling noisy data; it changes the numerics without giving a coherent probabilistic interpretation.

Watch Out

A GP does not necessarily interpolate the training labels

Exact interpolation happens only in the zero-noise model. With $\sigma_n^2 > 0$ , the posterior mean is a smoothed compromise between the prior geometry and the observations. This is often what you want: fitting noise exactly is not a sign of Bayesian virtue.

Summary

A GP is a distribution over functions: $f \sim \mathcal{GP}(m, k)$
Posterior is closed-form Gaussian for Gaussian likelihood
Posterior mean $= \mathbf{k}_*^\top(K + \sigma_n^2 I)^{-1}\mathbf{y}$ equals kernel ridge regression
Posterior variance gives calibrated uncertainty (wide far from data, narrow near data)
Marginal likelihood provides automatic hyperparameter selection with built-in Occam's razor
In practice, exact inference is a Cholesky-solve problem, not an explicit matrix inverse
Main limitation: $O(n^3)$ exact inference cost

Exercises

ExerciseCore

Problem

You have a GP with zero mean and squared exponential kernel with $\ell = 1$ , $\sigma_f^2 = 1$ , and $\sigma_n^2 = 0.1$ . You observe a single point $(x_1, y_1) = (0, 1)$ . What is the posterior mean and variance at $x_* = 0$ ? At $x_* = 10$ ?

ExerciseAdvanced

Problem

Prove that the GP posterior variance $\text{Var}(f_*) = k_{**} - \mathbf{k}_*^\top(K + \sigma_n^2 I)^{-1}\mathbf{k}_*$ is always non-negative. Why is this not obvious from the formula?

ExerciseResearch

Problem

Let $K_y(\eta)$ depend on a scalar hyperparameter $\eta$ (for example a length scale or noise variance), and define $\alpha = K_y^{-1}\mathbf{y}$ . Prove that

$\frac{\partial}{\partial \eta}\log p(\mathbf{y}|X) = \frac{1}{2}\alpha^\top \frac{\partial K_y}{\partial \eta}\alpha - \frac{1}{2}\operatorname{tr}\!\left(K_y^{-1}\frac{\partial K_y}{\partial \eta}\right).$

Interpret the two terms.

References

Canonical:

Rasmussen & Williams, Gaussian Processes for Machine Learning (2006), Chapters 1-5
Williams & Rasmussen, "Gaussian Processes for Regression" (NeurIPS 1996)
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapter 15
Bishop, Pattern Recognition and Machine Learning (2006), Section 6.4
Neal, Bayesian Learning for Neural Networks (1996), Chapter 2

Scalability and structure:

Titsias, "Variational Learning of Inducing Variables in Sparse Gaussian Processes" (AISTATS 2009)
Hensman, Fusi, Lawrence, "Gaussian processes for big data" (UAI 2013)
Wilson & Nickisch, "Kernel interpolation for scalable structured GPs" (ICML 2015)

Next Topics

The natural next step from Gaussian processes:

Bayesian Optimization for Hyperparameters: using GP posteriors to trade off exploration and exploitation

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

10

Joint, Marginal, and Conditional Distributionslayer 0A · tier 1
Conjugate Priorslayer 0B · tier 1
The Multivariate Normal Distributionlayer 0B · tier 1
Gram Matrices and Kernel Matriceslayer 1 · tier 1
Ridge Regressionlayer 1 · tier 1

Derived topics

6

Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Widthlayer 4 · tier 1
Bayesian Optimization for Hyperparameterslayer 3 · tier 2
Gaussian Process Regressionlayer 3 · tier 2
Bayesian Neural Networkslayer 3 · tier 3
Gaussian Processes in Astronomylayer 4 · tier 3

+1 more on the derived-topics page.

Graph-backed continuations

Bayesian Optimization for Hyperparameters Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Width Bayesian Neural Networks Gaussian Processes in Astronomy Kernel Methods for Molecules Gaussian Process Regression