Statistical Estimation
Bayesian Linear Regression
Gaussian prior, Gaussian likelihood, Gaussian posterior. Full posterior derivation by completing the square in the exponent: the posterior mean equals the ridge estimator, the predictive distribution has irreducible plus epistemic variance, and the marginal likelihood gives a closed-form hyperparameter selection criterion. Worked numeric example with three data points carries the algebra end to end.
Why This Matters
Bayesian linear regression is the cleanest place to see every piece of Bayesian inference fall into place at once. The prior, likelihood, and posterior are all Gaussian, so the algebra reduces to one move (completing the square in the exponent) that you have already seen in conjugate priors and the multivariate normal page. The posterior mean turns out to be the ridge estimator. The posterior covariance gives you uncertainty about . The predictive distribution gives you uncertainty about future . The marginal likelihood gives you a closed-form criterion for choosing the prior strength.
And it generalizes: replace the design matrix with a kernel-implied feature map and Bayesian linear regression becomes a Gaussian process. Almost every Bayesian regression page on the site rests on this one derivation.
Mental Model
You have a linear model with Gaussian noise . Frequentist OLS gives you a point estimate : one vector of weights, no uncertainty.
Bayesian linear regression treats as a random variable with a prior and updates the prior to a posterior using the data:
The product of two Gaussian factors is a Gaussian, so the posterior is for some and . The job is to derive those two quantities, and that's where the completing-the-square move shows up.
Two payoffs:
- Uncertainty. is your best point estimate; tells you how confident the data has made you about each coefficient.
- Predictive distribution. For a new input , the prediction is itself Gaussian with mean and variance : one term for irreducible noise, one term for epistemic uncertainty about . This is the difference between "I think it's about 3" and "I think it's ."
Formal Setup
The Bayesian linear regression model in canonical form:
with the design matrix, the response vector, the coefficient vector, the (known) noise variance, and the prior mean and covariance. We treat as known here; the case where is also unknown uses the Normal-Inverse-Gamma conjugate prior and is the next refinement.
The most common prior choice is and (zero mean, isotropic). The general derivation below covers both the isotropic case and the general case; the algebra is identical.
Full Posterior Derivation by Completing the Square
This is the central derivation of the page. Watch the completing-the-square recipe from the multivariate normal page do all the work.
Posterior in Bayesian Linear Regression
Statement
The posterior is Gaussian:
with
With the standard prior , :
So the posterior mean equals the ridge regression estimator with regularization parameter .
Intuition
The posterior precision is the sum of prior precision and data precision: . The posterior mean is a precision-weighted compromise between the prior mean and the data's "preferred" direction . More data (larger ) or tighter prior (smaller ) makes the posterior tighter. The ridge equivalence falls out of this: setting and makes the posterior mean exactly with , and this is also the MAP under the same prior (since the posterior is Gaussian, mean and mode coincide).
Proof Sketch
The log-posterior up to a -independent constant is
Expand both terms:
Collect the quadratic-in- terms: . Define the posterior precision , so this is .
Collect the linear-in- terms: . Define , so this is .
So . Apply the completing-the-square recipe (see multivariate normal): the posterior is Gaussian with mean and covariance . Substituting back:
For the standard prior, and , so and . Multiplying numerator and denominator by gives , the ridge form.
Why It Matters
This single derivation is the foundation of every Gaussian Bayesian regression model on the site. Adding a feature map to gives kernel ridge regression as a special case of Bayesian linear regression; the infinite-feature limit gives the Gaussian process posterior. The ridge equivalence shows that L2-regularized least squares is not a frequentist hack but the mean of the posterior under a Gaussian prior, and the ridge solution paired with the posterior covariance gives a credible interval that ridge alone does not provide.
Failure Mode
The derivation assumes the prior covariance is positive definite (so exists). For improper priors (, i.e. flat prior), the limit gives and , recovering the OLS solution provided is invertible. If is singular (e.g. when ), the OLS limit does not exist, and you must have a proper prior. This is why Bayesian linear regression remains well-defined in the high-dimensional regime while OLS does not.
Predictive Distribution
A point estimate is a starting point; a predictive distribution is what you want for downstream decisions. Given a new input , the prediction with has a closed-form distribution after marginalizing out .
Predictive Distribution in Bayesian Linear Regression
Statement
The predictive distribution of at a new input is Gaussian:
The variance decomposes into two parts: (the irreducible noise) and (the epistemic uncertainty about projected onto ).
Intuition
Even if you knew exactly, you would still face noise variance on any new observation. The extra term captures the fact that you don't know exactly, and the uncertainty depends on the direction of : predictions in directions where is large (poorly-constrained directions) carry more uncertainty than predictions in well-constrained directions. The MAP / point estimate alone hides this; it gives the same number regardless of how confident you are.
Proof Sketch
is the sum of two independent Gaussian terms (given the data): with mean and variance , and . The sum of independent Gaussians is Gaussian with summed means and summed variances.
Why It Matters
This is the predictive interval most people want when they read the words "regression prediction." Point predictions are calibrated only on average; predictive intervals are calibrated pointwise. The decomposition into aleatoric () and epistemic () uncertainty is what makes Bayesian regression useful for active learning (pick that maximally reduces ), Bayesian optimization (balance mean and uncertainty), and out-of-distribution detection (large flags inputs far from training data).
Failure Mode
The decomposition assumes the noise variance is known and correct. If is misspecified, the predictive interval is miscalibrated. The Normal-Inverse-Gamma conjugate prior treats as unknown and gives a Student's predictive distribution that correctly inflates intervals for small ; at the cost of slightly more complex algebra.
Marginal Likelihood (Evidence)
The marginal likelihood integrates out from the joint . It serves as the normalizer in Bayes' rule and as a model-comparison score: the "evidence" the data provides in favor of the chosen prior.
Marginal Likelihood for Bayesian Linear Regression
Statement
The log marginal likelihood with the isotropic prior is
Maximizing this jointly over and gives the empirical Bayes (or "evidence approximation") choice of hyperparameters, an alternative to cross-validation that is fully Bayesian whenever the Gaussian model is well-specified.
Intuition
The marginal likelihood is the probability the model assigned to the actual data, averaged over the prior. It penalizes models that are too flexible (the prior spreads its predicted- mass thin over many configurations, so any specific has low marginal probability) and models that are too rigid (the prior puts mass on configurations that disagree with ). The maximum-evidence hyperparameter sits at the "Occam's razor" sweet spot.
Proof Sketch
The joint of given (marginalizing over ): with and independent. The marginal is the affine image of a Gaussian, so . The density is the standard -dimensional Gaussian density with covariance , giving the stated log-density.
Why It Matters
The marginal likelihood is what lets you tune the prior strength automatically: maximize jointly over and and you get an empirical-Bayes estimate of both. This is Type-II maximum likelihood; in the Gaussian-process literature it is called "evidence maximization" and is the default hyperparameter selection method when cross-validation is too expensive. The same machinery generalizes: replacing with a kernel matrix gives the GP marginal likelihood.
Failure Mode
The marginal likelihood is not a substitute for cross-validation when the Gaussian model is misspecified. Mis-specified noise distributions, non-linear true response surfaces, or non-Gaussian outliers can produce a marginal likelihood that picks hyperparameters which overfit the noise model rather than the data. The matrix is ; for large , the determinant and inverse cost ; the same scaling problem GPs face. For , it is cheaper to compute via the Woodbury identity in -dimensional space.
Worked Numeric Example
Three data points, one feature, a tractable prior; enough to follow every step in numbers.
Setup. , (slope and intercept). Data:
The first column is the intercept; the second is the predictor. Prior: (so , ). Noise variance .
Compute and :
Posterior precision:
Posterior covariance (invert the matrix; determinant ):
Posterior mean (using ):
(Direct check: . The product computed step-by-step: , so .)
Compare to OLS. OLS solves , i.e. . Determinant , so .
So the posterior pulls each coefficient toward zero: intercept from 0.833 to 0.800, slope from 1.500 to 1.267. The shrinkage is the ridge effect with . The off-diagonal in says intercept and slope estimates are negatively correlated in the posterior (a small intercept goes with a large slope and vice versa); which makes sense for this dataset where the three points form a fairly tight line.
Predictive distribution at (predicting at the midpoint of the training range):
- Mean: .
- Variance: .
Compute the inner product: , and . So predictive variance , standard deviation .
So , and a 95% predictive interval is roughly ; wide because we have only three data points. With more data, shrinks and the epistemic component shrinks too; the irreducible floor remains.
Predictive distribution at (extrapolating well past the training range):
- Mean: .
- , and .
- Variance: , standard deviation .
The epistemic uncertainty at is 23× the irreducible noise; this is the Bayesian penalty for extrapolating. A frequentist point estimate would happily report 13.47 with no signal that the model has zero confidence in this prediction. Bayesian regression makes the uncertainty explicit: 95% predictive interval . This is what makes Bayesian methods valuable for safety-critical or active-learning applications.
Marginal likelihood. Compute with :
Determinant: expand along the first row. . So .
: invert the matrix or solve the linear system. Solving by Gaussian elimination on gives (verify: ; should be 1; the small discrepancy is rounding from approximating the matrix inversion; exact arithmetic gives , leading to ).
Putting it together: .
To pick a better empirically, repeat for several values and maximize. This is the Type-II MLE on the prior strength.
Connections
- Ridge regression (as MAP). is exactly the ridge estimator with . The MAP under a Gaussian prior equals the posterior mean here because the posterior is symmetric (Gaussian), so MAP estimation gives the same point as the posterior mean.
- OLS (flat-prior limit). sends and , recovering OLS if is invertible. The Bayesian framework gracefully handles the case where OLS does not exist.
- Kernel ridge regression. Replace with , the feature-mapped design matrix, and you get kernel ridge: same algebra, different design. With a positive definite kernel and isotropic prior in feature space, you arrive at kernel ridge regression.
- Gaussian processes. Take the infinite-feature limit (or specify a kernel directly, bypassing the explicit feature map) and you get a Gaussian process posterior. The GP predictive variance has the same epistemic-plus-aleatoric decomposition.
- Empirical Bayes. Maximize the marginal likelihood over to pick hyperparameters. This is the "Type-II MLE" or "evidence approximation" route; it generalizes to GPs as kernel hyperparameter selection by marginal-likelihood maximization.
Common Confusions
The posterior mean equals the ridge estimator, but the Bayesian framework is not 'just ridge'
Ridge regression gives a point. Bayesian linear regression gives a distribution: a posterior over with covariance , and a predictive distribution over with explicit aleatoric and epistemic variance. The point estimate is the same; the uncertainty quantification is what makes the Bayesian view useful for downstream tasks (active learning, BO, OOD detection). Saying "BLR is ridge with a clever interpretation" misses that ridge alone cannot produce a credible interval without auxiliary frequentist machinery (bootstrap, sandwich estimators).
The predictive variance has two parts, and they don't go to zero together
For a new , predictive variance is . As and the prior gets dominated, ; the epistemic part vanishes. But is the irreducible noise floor: no amount of data lets you predict a new better than . Confusing these is the root of the "I have a million data points, why is my prediction interval so wide?" mistake: the model has nailed , but itself is intrinsically noisy.
A flat prior is not always safe
The flat-prior limit recovers OLS if is invertible. In high dimensions (), is singular, the limit blows up, and the "least-committal" prior is in fact pathological; the posterior is improper and the marginal likelihood is zero. A proper prior is mathematically required, not just a stylistic choice, in the regime. Conversely, in the regime, the prior has negligible effect for moderate and the posterior is essentially the OLS sampling distribution.
Marginal likelihood is not the same as predictive accuracy
Marginal likelihood maximization picks hyperparameters that the prior thought were probable. Cross-validation picks hyperparameters that predict held-out data well. These usually agree but not always; particularly when the model is misspecified. For mission-critical predictive performance under misspecification, use cross-validation. For Bayesian model comparison or when the model is well-specified, marginal likelihood is the right choice.
Summary
- Bayesian linear regression posterior is Gaussian: , with and .
- Derived in one step by completing the square in the exponent of the log-posterior.
- Posterior mean equals the ridge estimator with for the isotropic prior.
- Predictive distribution: . The variance decomposes into irreducible () plus epistemic ().
- Marginal likelihood ; maximizing over is empirical Bayes.
- Generalizes cleanly: feature maps give kernel ridge regression, kernels in their own right give Gaussian processes, Normal-Inverse-Gamma priors handle unknown .
Exercises
Problem
Verify by direct expansion that the posterior precision falls out of the log-posterior. Start from the model and , write the log-posterior up to a -independent constant, and collect the coefficient.
Problem
Using the example dataset above (, , , ), compute the predictive mean and variance at . Compare to the value of in the training set.
Problem
Show that the posterior covariance satisfies in the Loewner order (i.e. is positive semi-definite). Interpret: observing data never makes the posterior less informative than the prior.
Problem
Suppose the noise variance is unknown and you place a Normal-Inverse-Gamma prior . Derive the marginal posterior of after integrating out . Verify that it is a multivariate Student's distribution and identify its degrees of freedom.
References
Canonical:
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. §3.3 (the gold standard for the Bayesian linear regression derivation, predictive distribution, and evidence approximation).
- Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. Ch. 14 (regression models).
- Rasmussen, C.E., & Williams, C.K.I. (2006). Gaussian Processes for Machine Learning. MIT Press. Ch. 2 (Bayesian linear regression as the finite-dimensional case of a Gaussian process; the same algebra in different notation).
- Lindley, D.V., & Smith, A.F.M. (1972). "Bayes Estimates for the Linear Model." J. Royal Statistical Society B, 34(1):1–41. (Original treatment of Bayesian regression and hierarchical models.)
Current:
- Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Ch. 11 (linear regression: Bayesian and frequentist views unified).
- Murphy, K.P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. Ch. 15 (Bayesian linear models in the high-dimensional and hierarchical settings).
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. §3.4 (ridge regression with the Bayesian connection as a sidebar).
Next Topics
- Gaussian processes for ML: the kernel-ized, infinite-feature version of this page. The posterior derivation generalizes by replacing with the kernel Gram matrix.
- Empirical Bayes vs hierarchical Bayes: how to handle hyperparameters in a fully Bayesian way (place a prior on them) or in an empirical-Bayes way (maximize marginal likelihood).
- Kernel trick: the same algebra in a feature space defined implicitly by a kernel.
Last reviewed: May 10, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
7- Conjugate Priorslayer 0B · tier 1
- Maximum A Posteriori (MAP) Estimationlayer 0B · tier 1
- Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
- The Multivariate Normal Distributionlayer 0B · tier 1
- Linear Regressionlayer 1 · tier 1
Derived topics
4- The Kernel Tricklayer 2 · tier 1
- Empirical Bayes vs Hierarchical Bayeslayer 2 · tier 2
- Gaussian Process Regressionlayer 3 · tier 2
- Gaussian Processes for Machine Learninglayer 4 · tier 3