Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Gaussian Process Regression

Inference with Gaussian processes: the prior-to-posterior update in closed form, the role of kernel choice, marginal likelihood for hyperparameter selection, sparse approximations for scalability, and the connection to Bayesian optimization.

AdvancedTier 2Stable~55 min

Why This Matters

Gaussian process regression gives you two things that most regression methods do not: a posterior mean (point prediction) and a posterior variance (uncertainty estimate) at every test point, both in closed form. The uncertainty is calibrated: it is large where training data is sparse and small where data is dense. This is a Bayesian approach: place a prior over functions, condition on data, and read off the posterior.

This built-in uncertainty quantification makes GPs the standard surrogate model in Bayesian optimization, where you need to decide where to evaluate an expensive function next. The acquisition function depends on both the predicted value and the uncertainty, and GPs provide both.

Formal Setup

Assume we observe nn training points (X,y)(X, \mathbf{y}) where X=[x1,,xn]X = [x_1, \ldots, x_n]^\top and y=[y1,,yn]\mathbf{y} = [y_1, \ldots, y_n]^\top with yi=f(xi)+εiy_i = f(x_i) + \varepsilon_i and εiN(0,σn2)\varepsilon_i \sim \mathcal{N}(0, \sigma_n^2).

Place a GP prior on ff: fGP(0,k(x,x))f \sim \mathcal{GP}(0, k(x, x')) where kk is the kernel (covariance function).

Definition

Kernel Matrix

The kernel matrix (Gram matrix) is KRn×nK \in \mathbb{R}^{n \times n} with Kij=k(xi,xj)K_{ij} = k(x_i, x_j). For a test point xx_*, define the vector k=[k(x,x1),,k(x,xn)]Rn\mathbf{k}_* = [k(x_*, x_1), \ldots, k(x_*, x_n)]^\top \in \mathbb{R}^n and scalar k=k(x,x)k_{**} = k(x_*, x_*).

Main Theorems

Theorem

GP Posterior Predictive Distribution

Statement

The posterior predictive distribution at a test point xx_* is Gaussian:

fX,y,xN(fˉ,var(f))f_* | X, \mathbf{y}, x_* \sim \mathcal{N}(\bar{f}_*, \text{var}(f_*))

with:

fˉ=k(K+σn2I)1y\bar{f}_* = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}

var(f)=kk(K+σn2I)1k\text{var}(f_*) = k_{**} - \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{k}_*

Intuition

The posterior mean is a weighted combination of training labels, where the weights come from solving a linear system involving the kernel matrix. Points xix_i close to xx_* (high k(x,xi)k(x_*, x_i)) contribute more to the prediction. The posterior variance starts at the prior variance kk_{**} and is reduced by the information from training data. Near training points, the variance is small. Far from training points, the variance returns to the prior level.

Proof Sketch

The joint distribution of [y,f][\mathbf{y}, f_*] is Gaussian:

[yf]N(0,[K+σn2Ikkk])\begin{bmatrix} \mathbf{y} \\ f_* \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} K + \sigma_n^2 I & \mathbf{k}_* \\ \mathbf{k}_*^\top & k_{**} \end{bmatrix}\right)

Condition ff_* on y\mathbf{y} using the standard Gaussian conditioning formula: μfy=Σf,yΣy,y1y\mu_{f_*|\mathbf{y}} = \Sigma_{f_*,\mathbf{y}} \Sigma_{\mathbf{y},\mathbf{y}}^{-1} \mathbf{y} and Σfy=Σf,fΣf,yΣy,y1Σy,f\Sigma_{f_*|\mathbf{y}} = \Sigma_{f_*,f_*} - \Sigma_{f_*,\mathbf{y}} \Sigma_{\mathbf{y},\mathbf{y}}^{-1} \Sigma_{\mathbf{y},f_*}.

Why It Matters

The posterior is available in closed form with a single matrix inversion. Unlike neural networks, GPs give calibrated uncertainty estimates without ensembles, dropout, or other approximations. The posterior mean is also the solution to kernel ridge regression, connecting Bayesian and frequentist perspectives.

Failure Mode

Computing (K+σn2I)1(K + \sigma_n^2 I)^{-1} costs O(n3)O(n^3) time and O(n2)O(n^2) memory. For n>10,000n > 10{,}000, exact GP regression is infeasible on standard hardware. This cubic scaling is the main practical limitation.

Kernel Choice

The kernel encodes prior assumptions about the function ff:

  • Squared exponential (RBF): k(x,x)=σf2exp(xx2/(22))k(x, x') = \sigma_f^2 \exp(-\|x - x'\|^2 / (2\ell^2)). Assumes ff is infinitely differentiable. The length scale \ell controls smoothness.
  • Matern-ν\nu: k(x,x)=σf221νΓ(ν)(2νxx)νKν(2νxx)k(x, x') = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu} \|x-x'\|}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}\|x-x'\|}{\ell}\right). Controls differentiability via ν\nu: ν=1/2\nu = 1/2 gives Ornstein-Uhlenbeck (rough), ν=3/2\nu = 3/2 gives once-differentiable, ν\nu \to \infty gives the RBF.
  • Periodic: for functions with known periodicity.
  • Sums and products of kernels are valid kernels, enabling compositional modeling.

Marginal Likelihood for Hyperparameter Selection

Proposition

GP Log Marginal Likelihood

Statement

The log marginal likelihood (evidence) is:

logp(yX,θ)=12y(Kθ+σn2I)1y12logKθ+σn2In2log2π\log p(\mathbf{y} | X, \theta) = -\frac{1}{2} \mathbf{y}^\top (K_\theta + \sigma_n^2 I)^{-1} \mathbf{y} - \frac{1}{2} \log |K_\theta + \sigma_n^2 I| - \frac{n}{2} \log 2\pi

where θ\theta denotes kernel hyperparameters (length scale \ell, signal variance σf2\sigma_f^2, etc.).

Intuition

The three terms have clear interpretations. The first is a data-fit term: how well the model explains the observations. The second is a complexity penalty: models with larger determinant (more flexible kernels) are penalized. The third is a normalization constant. Maximizing the marginal likelihood automatically balances fit and complexity, implementing an Occam's razor without explicit regularization.

Proof Sketch

Since yN(0,Kθ+σn2I)\mathbf{y} \sim \mathcal{N}(0, K_\theta + \sigma_n^2 I), the log density is the standard multivariate Gaussian log-pdf.

Why It Matters

The marginal likelihood provides a principled way to select kernel hyperparameters without cross-validation. Gradient-based optimization of the log marginal likelihood (type II maximum likelihood) is the standard approach. Unlike cross-validation, it uses all data for both fitting and selection.

Failure Mode

The marginal likelihood is non-convex in the hyperparameters and can have multiple local optima, especially with compositional kernels. Different initializations can lead to different solutions. The marginal likelihood can also overfit the hyperparameters when nn is small relative to the number of hyperparameters.

Sparse Gaussian Processes

Exact GP regression costs O(n3)O(n^3). Sparse GPs reduce this by selecting mnm \ll n inducing points Z={z1,,zm}Z = \{z_1, \ldots, z_m\} and approximating the full GP through these points.

The variational free energy (VFE, Titsias 2009) approach optimizes the inducing point locations and a variational distribution q(u)q(\mathbf{u}) over the function values at the inducing points. The resulting cost is O(nm2)O(nm^2) for training and O(m2)O(m^2) for prediction.

The FITC (fully independent training conditional) approximation assumes conditional independence of training points given the inducing points. This gives a diagonal correction to the inducing point approximation.

For m=500m = 500 inducing points, sparse GPs can handle n=100,000n = 100{,}000 training points. The choice of inducing point locations matters: placing them in regions of high data density gives better approximations.

Connection to Bayesian Optimization

Bayesian optimization uses a GP as a surrogate for an expensive black-box function ff. The GP posterior mean and variance define an acquisition function (e.g., expected improvement, UCB) that balances exploitation (evaluating where the mean is low) and exploration (evaluating where the variance is high).

The GP's calibrated uncertainty is critical here. If the uncertainty is underestimated, the optimizer exploits too early. If overestimated, it explores too much. GP uncertainty, conditioned on kernel hyperparameters fit via marginal likelihood, is well-calibrated for smooth functions.

Common Confusions

Watch Out

GP uncertainty is conditional on the kernel being correct

The posterior variance is exact given the GP prior. If the true function is not well-modeled by the chosen kernel (e.g., using an RBF kernel for a discontinuous function), the uncertainty estimates are miscalibrated. The GP does not "know" when its kernel assumption is wrong.

Watch Out

The posterior mean equals kernel ridge regression, but the variance does not

The GP posterior mean fˉ=k(K+σn2I)1y\bar{f}_* = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y} is identical to the kernel ridge regression prediction with regularization λ=σn2\lambda = \sigma_n^2. But kernel ridge regression does not produce the posterior variance. The uncertainty quantification is unique to the Bayesian (GP) formulation.

Canonical Examples

Example

GP regression on a 1D function

Sample 20 points from f(x)=sin(3x)f(x) = \sin(3x) on [0,2π][0, 2\pi] with noise σn=0.2\sigma_n = 0.2. Fit a GP with RBF kernel, optimizing \ell and σf\sigma_f via marginal likelihood. The posterior mean tracks the true sine function. The 95% confidence band (±2\pm 2 standard deviations) is narrow near training points and widens in gaps between points. At the edges of the training range (x<0x < 0 or x>2πx > 2\pi), the posterior reverts to the prior mean (zero) with prior variance, correctly reflecting that the GP has no information there.

Key Takeaways

  • GP regression gives a closed-form Gaussian posterior: fˉ=k(K+σn2I)1y\bar{f}_* = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}
  • Posterior variance is large where data is sparse and small near training points
  • Kernel choice encodes smoothness and structure assumptions about ff
  • Marginal likelihood selects hyperparameters without cross-validation via automatic Occam's razor
  • Exact GP costs O(n3)O(n^3); sparse GPs with mm inducing points cost O(nm2)O(nm^2)
  • GPs are the default surrogate model for Bayesian optimization

Exercises

ExerciseCore

Problem

For a GP with RBF kernel (=1\ell = 1, σf2=1\sigma_f^2 = 1) and noise σn2=0.1\sigma_n^2 = 0.1, you observe one training point: x1=0x_1 = 0, y1=1y_1 = 1. Compute the posterior mean and variance at x=0x_* = 0 and at x=3x_* = 3.

ExerciseAdvanced

Problem

Show that the GP marginal likelihood penalizes both underfitting and overfitting. Specifically, consider the RBF kernel with length scale \ell. Explain what happens to the data-fit term and the complexity term as 0\ell \to 0 and as \ell \to \infty.

Connection to Kriging and RBF Interpolation

GP regression is mathematically identical to kriging, the geostatistical interpolation method developed by Matheron (1963) building on Krige's mining work.

In kriging, the kernel is called the variogram (or its complement, the covariance function), the posterior mean is the kriging predictor, and the posterior variance is the kriging variance. The equations are the same: f^(x)=k(K+σn2I)1y\hat{f}(x_*) = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}. The geostatistical and machine learning communities developed the same theory independently.

Radial basis function (RBF) interpolation is the noise-free special case: set σn2=0\sigma_n^2 = 0, and the GP posterior mean becomes f^(x)=kK1y\hat{f}(x_*) = \mathbf{k}_*^\top K^{-1} \mathbf{y}, which passes exactly through every training point. This is classical RBF interpolation with the kernel k(x,x)k(x, x') serving as the radial basis function. Common choices include the Gaussian (exp(r2/22)\exp(-r^2/2\ell^2)), multiquadric (1+r2/2\sqrt{1 + r^2/\ell^2}), and Matern kernels.

The GP perspective adds two things that classical RBF interpolation lacks: principled uncertainty quantification (the posterior variance) and hyperparameter selection via marginal likelihood.

References

Canonical:

Kriging and RBF:

  • Matheron, "Principles of Geostatistics" (Economic Geology, 1963). Founding paper of kriging theory.
  • Cressie, Statistics for Spatial Data (Rev. ed., 1993), Chapters 3-4
  • Buhmann, Radial Basis Functions: Theory and Implementations (2003), Chapters 1-3
  • Wendland, Scattered Data Approximation (2005). Native spaces and convergence theory for RBF interpolation.

Current:

  • Titsias, "Variational Learning of Inducing Variables in Sparse Gaussian Processes" (AISTATS 2009)
  • Hensman, Fusi, Lawrence, "Gaussian Processes for Big Data" (UAI 2013)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics