ML Methods
Gaussian Process Regression
Inference with Gaussian processes: the prior-to-posterior update in closed form, the role of kernel choice, marginal likelihood for hyperparameter selection, sparse approximations for scalability, and the connection to Bayesian optimization.
Prerequisites
Why This Matters
Gaussian process regression gives you two things that most regression methods do not: a posterior mean (point prediction) and a posterior variance (uncertainty estimate) at every test point, both in closed form. The uncertainty is calibrated: it is large where training data is sparse and small where data is dense. This is a Bayesian approach: place a prior over functions, condition on data, and read off the posterior.
This built-in uncertainty quantification makes GPs the standard surrogate model in Bayesian optimization, where you need to decide where to evaluate an expensive function next. The acquisition function depends on both the predicted value and the uncertainty, and GPs provide both.
Formal Setup
Assume we observe training points where and with and .
Place a GP prior on : where is the kernel (covariance function).
Kernel Matrix
The kernel matrix (Gram matrix) is with . For a test point , define the vector and scalar .
Main Theorems
GP Posterior Predictive Distribution
Statement
The posterior predictive distribution at a test point is Gaussian:
with:
Intuition
The posterior mean is a weighted combination of training labels, where the weights come from solving a linear system involving the kernel matrix. Points close to (high ) contribute more to the prediction. The posterior variance starts at the prior variance and is reduced by the information from training data. Near training points, the variance is small. Far from training points, the variance returns to the prior level.
Proof Sketch
The joint distribution of is Gaussian:
Condition on using the standard Gaussian conditioning formula: and .
Why It Matters
The posterior is available in closed form with a single matrix inversion. Unlike neural networks, GPs give calibrated uncertainty estimates without ensembles, dropout, or other approximations. The posterior mean is also the solution to kernel ridge regression, connecting Bayesian and frequentist perspectives.
Failure Mode
Computing costs time and memory. For , exact GP regression is infeasible on standard hardware. This cubic scaling is the main practical limitation.
Kernel Choice
The kernel encodes prior assumptions about the function :
- Squared exponential (RBF): . Assumes is infinitely differentiable. The length scale controls smoothness.
- Matern-: . Controls differentiability via : gives Ornstein-Uhlenbeck (rough), gives once-differentiable, gives the RBF.
- Periodic: for functions with known periodicity.
- Sums and products of kernels are valid kernels, enabling compositional modeling.
Marginal Likelihood for Hyperparameter Selection
GP Log Marginal Likelihood
Statement
The log marginal likelihood (evidence) is:
where denotes kernel hyperparameters (length scale , signal variance , etc.).
Intuition
The three terms have clear interpretations. The first is a data-fit term: how well the model explains the observations. The second is a complexity penalty: models with larger determinant (more flexible kernels) are penalized. The third is a normalization constant. Maximizing the marginal likelihood automatically balances fit and complexity, implementing an Occam's razor without explicit regularization.
Proof Sketch
Since , the log density is the standard multivariate Gaussian log-pdf.
Why It Matters
The marginal likelihood provides a principled way to select kernel hyperparameters without cross-validation. Gradient-based optimization of the log marginal likelihood (type II maximum likelihood) is the standard approach. Unlike cross-validation, it uses all data for both fitting and selection.
Failure Mode
The marginal likelihood is non-convex in the hyperparameters and can have multiple local optima, especially with compositional kernels. Different initializations can lead to different solutions. The marginal likelihood can also overfit the hyperparameters when is small relative to the number of hyperparameters.
Sparse Gaussian Processes
Exact GP regression costs . Sparse GPs reduce this by selecting inducing points and approximating the full GP through these points.
The variational free energy (VFE, Titsias 2009) approach optimizes the inducing point locations and a variational distribution over the function values at the inducing points. The resulting cost is for training and for prediction.
The FITC (fully independent training conditional) approximation assumes conditional independence of training points given the inducing points. This gives a diagonal correction to the inducing point approximation.
For inducing points, sparse GPs can handle training points. The choice of inducing point locations matters: placing them in regions of high data density gives better approximations.
Connection to Bayesian Optimization
Bayesian optimization uses a GP as a surrogate for an expensive black-box function . The GP posterior mean and variance define an acquisition function (e.g., expected improvement, UCB) that balances exploitation (evaluating where the mean is low) and exploration (evaluating where the variance is high).
The GP's calibrated uncertainty is critical here. If the uncertainty is underestimated, the optimizer exploits too early. If overestimated, it explores too much. GP uncertainty, conditioned on kernel hyperparameters fit via marginal likelihood, is well-calibrated for smooth functions.
Common Confusions
GP uncertainty is conditional on the kernel being correct
The posterior variance is exact given the GP prior. If the true function is not well-modeled by the chosen kernel (e.g., using an RBF kernel for a discontinuous function), the uncertainty estimates are miscalibrated. The GP does not "know" when its kernel assumption is wrong.
The posterior mean equals kernel ridge regression, but the variance does not
The GP posterior mean is identical to the kernel ridge regression prediction with regularization . But kernel ridge regression does not produce the posterior variance. The uncertainty quantification is unique to the Bayesian (GP) formulation.
Canonical Examples
GP regression on a 1D function
Sample 20 points from on with noise . Fit a GP with RBF kernel, optimizing and via marginal likelihood. The posterior mean tracks the true sine function. The 95% confidence band ( standard deviations) is narrow near training points and widens in gaps between points. At the edges of the training range ( or ), the posterior reverts to the prior mean (zero) with prior variance, correctly reflecting that the GP has no information there.
Key Takeaways
- GP regression gives a closed-form Gaussian posterior:
- Posterior variance is large where data is sparse and small near training points
- Kernel choice encodes smoothness and structure assumptions about
- Marginal likelihood selects hyperparameters without cross-validation via automatic Occam's razor
- Exact GP costs ; sparse GPs with inducing points cost
- GPs are the default surrogate model for Bayesian optimization
Exercises
Problem
For a GP with RBF kernel (, ) and noise , you observe one training point: , . Compute the posterior mean and variance at and at .
Problem
Show that the GP marginal likelihood penalizes both underfitting and overfitting. Specifically, consider the RBF kernel with length scale . Explain what happens to the data-fit term and the complexity term as and as .
Connection to Kriging and RBF Interpolation
GP regression is mathematically identical to kriging, the geostatistical interpolation method developed by Matheron (1963) building on Krige's mining work.
In kriging, the kernel is called the variogram (or its complement, the covariance function), the posterior mean is the kriging predictor, and the posterior variance is the kriging variance. The equations are the same: . The geostatistical and machine learning communities developed the same theory independently.
Radial basis function (RBF) interpolation is the noise-free special case: set , and the GP posterior mean becomes , which passes exactly through every training point. This is classical RBF interpolation with the kernel serving as the radial basis function. Common choices include the Gaussian (), multiquadric (), and Matern kernels.
The GP perspective adds two things that classical RBF interpolation lacks: principled uncertainty quantification (the posterior variance) and hyperparameter selection via marginal likelihood.
References
Canonical:
- Rasmussen & Williams, Gaussian Processes for Machine Learning (2006), Chapters 2, 5
- Bishop, Pattern Recognition and Machine Learning (2006), Section 6.4
Kriging and RBF:
- Matheron, "Principles of Geostatistics" (Economic Geology, 1963). Founding paper of kriging theory.
- Cressie, Statistics for Spatial Data (Rev. ed., 1993), Chapters 3-4
- Buhmann, Radial Basis Functions: Theory and Implementations (2003), Chapters 1-3
- Wendland, Scattered Data Approximation (2005). Native spaces and convergence theory for RBF interpolation.
Current:
- Titsias, "Variational Learning of Inducing Variables in Sparse Gaussian Processes" (AISTATS 2009)
- Hensman, Fusi, Lawrence, "Gaussian Processes for Big Data" (UAI 2013)
Next Topics
- Neural tangent kernel: the infinite-width neural network as a GP
- Cross-validation theory: alternative model selection when marginal likelihood is unavailable
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Gaussian Processes for Machine LearningLayer 4
- Kernels and Reproducing Kernel Hilbert SpacesLayer 3
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Rademacher ComplexityLayer 3
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- VC DimensionLayer 2