Bayesian Linear Regression

Sneiderman, Robby

Statistical Estimation

Bayesian Linear Regression

Gaussian prior, Gaussian likelihood, Gaussian posterior. Full posterior derivation by completing the square in the exponent: the posterior mean equals the ridge estimator, the predictive distribution has irreducible plus epistemic variance, and the marginal likelihood gives a closed-form hyperparameter selection criterion. Worked numeric example with three data points carries the algebra end to end.

AdvancedTier 1StableCore spine~110 min

Prerequisites

Linear Regression Ridge Regression Multivariate Normal Distribution Bayesian Estimation

Prereq Map

Why This Matters

Bayesian linear regression is the cleanest place to see every piece of Bayesian inference fall into place at once. The prior, likelihood, and posterior are all Gaussian, so the algebra reduces to one move (completing the square in the exponent) that you have already seen in conjugate priors and the multivariate normal page. The posterior mean turns out to be the ridge estimator. The posterior covariance gives you uncertainty about $\beta$ . The predictive distribution gives you uncertainty about future $y$ . The marginal likelihood gives you a closed-form criterion for choosing the prior strength.

And it generalizes: replace the design matrix $X$ with a kernel-implied feature map $\phi(X)$ and Bayesian linear regression becomes a Gaussian process. Almost every Bayesian regression page on the site rests on this one derivation.

Mental Model

You have a linear model $y = X\beta + \varepsilon$ with Gaussian noise $\varepsilon \sim \mathcal N(0, \sigma^2 I)$ . Frequentist OLS gives you a point estimate $\hat\beta_{\mathrm{OLS}} = (X^\top X)^{-1} X^\top y$ : one vector of weights, no uncertainty.

Bayesian linear regression treats $\beta$ as a random variable with a prior $\beta \sim \mathcal N(\mu_0, \Sigma_0)$ and updates the prior to a posterior using the data:

$\pi(\beta \mid X, y) \propto p(y \mid X, \beta) \pi(\beta).$

The product of two Gaussian factors is a Gaussian, so the posterior is $\mathcal N(\mu_n, \Sigma_n)$ for some $\mu_n$ and $\Sigma_n$ . The job is to derive those two quantities, and that's where the completing-the-square move shows up.

Two payoffs:

Uncertainty. $\mu_n$ is your best point estimate; $\Sigma_n$ tells you how confident the data has made you about each coefficient.
Predictive distribution. For a new input $x_*$ , the prediction $y_*$ is itself Gaussian with mean $x_*^\top \mu_n$ and variance $\sigma^2 + x_*^\top \Sigma_n x_*$ : one term for irreducible noise, one term for epistemic uncertainty about $\beta$ . This is the difference between "I think it's about 3" and "I think it's $3 \pm 0.4$ ."

Formal Setup

The Bayesian linear regression model in canonical form:

$\beta \sim \mathcal N(\mu_0, \Sigma_0), \qquad y \mid X, \beta \sim \mathcal N(X\beta, \sigma^2 I_n),$

with $X \in \mathbb R^{n \times d}$ the design matrix, $y \in \mathbb R^n$ the response vector, $\beta \in \mathbb R^d$ the coefficient vector, $\sigma^2$ the (known) noise variance, and $\mu_0, \Sigma_0$ the prior mean and covariance. We treat $\sigma^2$ as known here; the case where $\sigma^2$ is also unknown uses the Normal-Inverse-Gamma conjugate prior and is the next refinement.

The most common prior choice is $\mu_0 = 0$ and $\Sigma_0 = \tau^2 I_d$ (zero mean, isotropic). The general derivation below covers both the isotropic case and the general $\Sigma_0$ case; the algebra is identical.

Full Posterior Derivation by Completing the Square

This is the central derivation of the page. Watch the completing-the-square recipe from the multivariate normal page do all the work.

Theorem

Posterior in Bayesian Linear Regression

Statement

The posterior is Gaussian:

$\beta \mid X, y \;\sim\; \mathcal N(\mu_n, \Sigma_n)$

with

$\Sigma_n^{-1} = \Sigma_0^{-1} + \sigma^{-2} X^\top X, \qquad \mu_n = \Sigma_n \bigl(\Sigma_0^{-1} \mu_0 + \sigma^{-2} X^\top y\bigr).$

With the standard prior $\mu_0 = 0$ , $\Sigma_0 = \tau^2 I_d$ :

$\Sigma_n = \bigl(\tau^{-2} I_d + \sigma^{-2} X^\top X\bigr)^{-1}, \quad \mu_n = \sigma^{-2} \Sigma_n X^\top y = \bigl(X^\top X + \tfrac{\sigma^2}{\tau^2} I_d\bigr)^{-1} X^\top y.$

So the posterior mean equals the ridge regression estimator with regularization parameter $\lambda = \sigma^2 / \tau^2$ .

Intuition

The posterior precision is the sum of prior precision and data precision: $\Sigma_n^{-1} = \Sigma_0^{-1} + \sigma^{-2} X^\top X$ . The posterior mean is a precision-weighted compromise between the prior mean $\mu_0$ and the data's "preferred" $\beta$ direction $\sigma^{-2} X^\top y$ . More data (larger $X^\top X$ ) or tighter prior (smaller $\Sigma_0$ ) makes the posterior tighter. The ridge equivalence falls out of this: setting $\mu_0 = 0$ and $\Sigma_0 = \tau^2 I$ makes the posterior mean exactly $(X^\top X + \lambda I)^{-1} X^\top y$ with $\lambda = \sigma^2/\tau^2$ , and this is also the MAP under the same prior (since the posterior is Gaussian, mean and mode coincide).

Proof Sketch

The log-posterior up to a $\beta$ -independent constant is

$\log \pi(\beta \mid X, y) = -\tfrac1{2\sigma^2}(y - X\beta)^\top (y - X\beta) - \tfrac12 (\beta - \mu_0)^\top \Sigma_0^{-1}(\beta - \mu_0) + \text{const}.$

Expand both terms:

$\underbrace{-\tfrac1{2\sigma^2}(y^\top y - 2 y^\top X\beta + \beta^\top X^\top X \beta)}_{\text{likelihood}} \;+\; \underbrace{-\tfrac12(\mu_0^\top \Sigma_0^{-1}\mu_0 - 2 \mu_0^\top \Sigma_0^{-1} \beta + \beta^\top \Sigma_0^{-1} \beta)}_{\text{prior}}.$

Collect the quadratic-in- $\beta$ terms: $-\tfrac1{2\sigma^2}\beta^\top X^\top X \beta - \tfrac12 \beta^\top \Sigma_0^{-1} \beta = -\tfrac12 \beta^\top (\Sigma_0^{-1} + \sigma^{-2} X^\top X) \beta$ . Define the posterior precision $P = \Sigma_0^{-1} + \sigma^{-2} X^\top X$ , so this is $-\tfrac12 \beta^\top P \beta$ .

Collect the linear-in- $\beta$ terms: $\sigma^{-2} y^\top X \beta + \mu_0^\top \Sigma_0^{-1} \beta = (\Sigma_0^{-1} \mu_0 + \sigma^{-2} X^\top y)^\top \beta$ . Define $b = \Sigma_0^{-1} \mu_0 + \sigma^{-2} X^\top y$ , so this is $b^\top \beta$ .

So $\log \pi(\beta \mid X, y) = -\tfrac12 \beta^\top P \beta + b^\top \beta + \text{const}$ . Apply the completing-the-square recipe (see multivariate normal): the posterior is Gaussian with mean $P^{-1} b$ and covariance $P^{-1}$ . Substituting back:

$\Sigma_n = P^{-1} = (\Sigma_0^{-1} + \sigma^{-2} X^\top X)^{-1}, \qquad \mu_n = \Sigma_n b = \Sigma_n(\Sigma_0^{-1}\mu_0 + \sigma^{-2} X^\top y).$

For the standard prior, $\Sigma_0^{-1} = \tau^{-2} I$ and $\mu_0 = 0$ , so $\Sigma_n = (\tau^{-2} I + \sigma^{-2} X^\top X)^{-1}$ and $\mu_n = \sigma^{-2} \Sigma_n X^\top y$ . Multiplying numerator and denominator by $\sigma^2$ gives $\mu_n = (X^\top X + \tfrac{\sigma^2}{\tau^2} I)^{-1} X^\top y$ , the ridge form.

Why It Matters

This single derivation is the foundation of every Gaussian Bayesian regression model on the site. Adding a feature map $\phi$ to $X$ gives kernel ridge regression as a special case of Bayesian linear regression; the infinite-feature limit gives the Gaussian process posterior. The ridge equivalence shows that L2-regularized least squares is not a frequentist hack but the mean of the posterior under a Gaussian prior, and the ridge solution paired with the posterior covariance gives a credible interval that ridge alone does not provide.

Failure Mode

The derivation assumes the prior covariance $\Sigma_0$ is positive definite (so $\Sigma_0^{-1}$ exists). For improper priors ( $\Sigma_0 \to \infty I$ , i.e. flat prior), the limit gives $\Sigma_n \to (\sigma^{-2} X^\top X)^{-1}$ and $\mu_n \to (X^\top X)^{-1} X^\top y$ , recovering the OLS solution provided $X^\top X$ is invertible. If $X^\top X$ is singular (e.g. when $d > n$ ), the OLS limit does not exist, and you must have a proper prior. This is why Bayesian linear regression remains well-defined in the high-dimensional regime $d > n$ while OLS does not.

report a correction →

Predictive Distribution

A point estimate is a starting point; a predictive distribution is what you want for downstream decisions. Given a new input $x_*$ , the prediction $y_* = x_*^\top \beta + \varepsilon_*$ with $\varepsilon_* \sim \mathcal N(0, \sigma^2)$ has a closed-form distribution after marginalizing out $\beta$ .

Theorem

Predictive Distribution in Bayesian Linear Regression

Statement

The predictive distribution of $y_*$ at a new input $x_*$ is Gaussian:

$y_* \mid x_*, X, y \;\sim\; \mathcal N\!\left( x_*^\top \mu_n, \;\; \sigma^2 + x_*^\top \Sigma_n x_* \right).$

The variance decomposes into two parts: $\sigma^2$ (the irreducible noise) and $x_*^\top \Sigma_n x_*$ (the epistemic uncertainty about $\beta$ projected onto $x_*$ ).

Intuition

Even if you knew $\beta$ exactly, you would still face noise variance $\sigma^2$ on any new observation. The extra term $x_*^\top \Sigma_n x_*$ captures the fact that you don't know $\beta$ exactly, and the uncertainty depends on the direction of $x_*$ : predictions in directions where $\Sigma_n$ is large (poorly-constrained directions) carry more uncertainty than predictions in well-constrained directions. The MAP / point estimate $x_*^\top \mu_n$ alone hides this; it gives the same number regardless of how confident you are.

Proof Sketch

$y_* = x_*^\top \beta + \varepsilon_*$ is the sum of two independent Gaussian terms (given the data): $x_*^\top \beta$ with mean $x_*^\top \mu_n$ and variance $x_*^\top \Sigma_n x_*$ , and $\varepsilon_* \sim \mathcal N(0, \sigma^2)$ . The sum of independent Gaussians is Gaussian with summed means and summed variances.

Why It Matters

This is the predictive interval most people want when they read the words "regression prediction." Point predictions are calibrated only on average; predictive intervals are calibrated pointwise. The decomposition into aleatoric ( $\sigma^2$ ) and epistemic ( $x_*^\top \Sigma_n x_*$ ) uncertainty is what makes Bayesian regression useful for active learning (pick $x_*$ that maximally reduces $\Sigma_n$ ), Bayesian optimization (balance mean and uncertainty), and out-of-distribution detection (large $x_*^\top \Sigma_n x_*$ flags inputs far from training data).

Failure Mode

The decomposition assumes the noise variance $\sigma^2$ is known and correct. If $\sigma^2$ is misspecified, the predictive interval is miscalibrated. The Normal-Inverse-Gamma conjugate prior treats $\sigma^2$ as unknown and gives a Student's $t$ predictive distribution that correctly inflates intervals for small $n$ ; at the cost of slightly more complex algebra.

report a correction →

Marginal Likelihood (Evidence)

The marginal likelihood $p(y \mid X)$ integrates out $\beta$ from the joint $p(y, \beta \mid X) = p(y \mid X, \beta) \pi(\beta)$ . It serves as the normalizer in Bayes' rule and as a model-comparison score: the "evidence" the data provides in favor of the chosen prior.

Theorem

Marginal Likelihood for Bayesian Linear Regression

Statement

The log marginal likelihood with the isotropic prior $\beta \sim \mathcal N(0, \tau^2 I)$ is

$\log p(y \mid X, \sigma^2, \tau^2) = -\tfrac12 y^\top (\sigma^2 I_n + \tau^2 X X^\top)^{-1} y \;-\; \tfrac12 \log |\sigma^2 I_n + \tau^2 X X^\top| \;-\; \tfrac n2 \log(2\pi).$

Maximizing this jointly over $\sigma^2$ and $\tau^2$ gives the empirical Bayes (or "evidence approximation") choice of hyperparameters, an alternative to cross-validation that is fully Bayesian whenever the Gaussian model is well-specified.

Intuition

The marginal likelihood is the probability the model assigned to the actual data, averaged over the prior. It penalizes models that are too flexible (the prior spreads its predicted- $y$ mass thin over many configurations, so any specific $y$ has low marginal probability) and models that are too rigid (the prior puts mass on configurations that disagree with $y$ ). The maximum-evidence hyperparameter sits at the "Occam's razor" sweet spot.

Proof Sketch

The joint of $y$ given $X$ (marginalizing over $\beta$ ): $y = X\beta + \varepsilon$ with $\beta \sim \mathcal N(0, \tau^2 I)$ and $\varepsilon \sim \mathcal N(0, \sigma^2 I)$ independent. The marginal is the affine image of a Gaussian, so $y \sim \mathcal N(0, \tau^2 X X^\top + \sigma^2 I)$ . The density is the standard $n$ -dimensional Gaussian density with covariance $\sigma^2 I + \tau^2 X X^\top$ , giving the stated log-density.

Why It Matters

The marginal likelihood is what lets you tune the prior strength automatically: maximize $\log p(y \mid X, \sigma^2, \tau^2)$ jointly over $\sigma^2$ and $\tau^2$ and you get an empirical-Bayes estimate of both. This is Type-II maximum likelihood; in the Gaussian-process literature it is called "evidence maximization" and is the default hyperparameter selection method when cross-validation is too expensive. The same machinery generalizes: replacing $XX^\top$ with a kernel matrix $K$ gives the GP marginal likelihood.

Failure Mode

The marginal likelihood is not a substitute for cross-validation when the Gaussian model is misspecified. Mis-specified noise distributions, non-linear true response surfaces, or non-Gaussian outliers can produce a marginal likelihood that picks hyperparameters which overfit the noise model rather than the data. The matrix $\sigma^2 I + \tau^2 X X^\top$ is $n \times n$ ; for large $n$ , the determinant and inverse cost $O(n^3)$ ; the same scaling problem GPs face. For $d \ll n$ , it is cheaper to compute via the Woodbury identity in $d$ -dimensional space.

report a correction →

Worked Numeric Example

Three data points, one feature, a tractable prior; enough to follow every step in numbers.

Setup. $n = 3$ , $d = 2$ (slope and intercept). Data:

$X = \begin{pmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{pmatrix}, \qquad y = \begin{pmatrix} 1 \\ 2 \\ 4 \end{pmatrix}.$

The first column is the intercept; the second is the predictor. Prior: $\beta = (\beta_0, \beta_1)^\top \sim \mathcal N(0, I_2)$ (so $\Sigma_0 = I$ , $\mu_0 = 0$ ). Noise variance $\sigma^2 = 1$ .

Compute $X^\top X$ and $X^\top y$ :

$X^\top X = \begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix}, \qquad X^\top y = \begin{pmatrix} 7 \\ 10 \end{pmatrix}.$

Posterior precision:

$\Sigma_n^{-1} = \Sigma_0^{-1} + \sigma^{-2} X^\top X = I + \begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix} = \begin{pmatrix} 4 & 3 \\ 3 & 6 \end{pmatrix}.$

Posterior covariance (invert the $2 \times 2$ matrix; determinant $= 4 \cdot 6 - 3 \cdot 3 = 15$ ):

$\Sigma_n = \tfrac{1}{15} \begin{pmatrix} 6 & -3 \\ -3 & 4 \end{pmatrix} = \begin{pmatrix} 0.4 & -0.2 \\ -0.2 & 0.267 \end{pmatrix}.$

Posterior mean (using $b = \Sigma_0^{-1}\mu_0 + \sigma^{-2}X^\top y = X^\top y = (7, 10)^\top$ ):

$\mu_n = \Sigma_n \begin{pmatrix} 7 \\ 10 \end{pmatrix} = \begin{pmatrix} 0.4 \cdot 7 + (-0.2) \cdot 10 \\ (-0.2) \cdot 7 + 0.267 \cdot 10 \end{pmatrix} = \begin{pmatrix} 0.800 \\ 1.267 \end{pmatrix}.$

(Direct check: $\Sigma_n b = \Sigma_n X^\top y$ . The product $\begin{pmatrix} 4 & 3 \\ 3 & 6 \end{pmatrix}^{-1}\begin{pmatrix} 7 \\ 10 \end{pmatrix}$ computed step-by-step: $\Sigma_n = \frac1{15}\begin{pmatrix}6 & -3 \\ -3 & 4\end{pmatrix}$ , so $\mu_n = \frac1{15}\begin{pmatrix}6 \cdot 7 - 3 \cdot 10 \\ -3 \cdot 7 + 4 \cdot 10\end{pmatrix} = \frac1{15}\begin{pmatrix}12 \\ 19\end{pmatrix} = (0.800, 1.267)^\top$ .)

Compare to OLS. OLS solves $(X^\top X) \beta = X^\top y$ , i.e. $\begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix} \beta = \begin{pmatrix} 7 \\ 10 \end{pmatrix}$ . Determinant $= 6$ , so $\hat\beta_{\mathrm{OLS}} = \frac16\begin{pmatrix} 5 \cdot 7 - 3 \cdot 10 \\ -3 \cdot 7 + 3 \cdot 10 \end{pmatrix} = \frac16 \begin{pmatrix} 5 \\ 9 \end{pmatrix} = (0.833, 1.500)^\top$ .

So the posterior pulls each coefficient toward zero: intercept from 0.833 to 0.800, slope from 1.500 to 1.267. The shrinkage is the ridge effect with $\lambda = \sigma^2/\tau^2 = 1$ . The off-diagonal $-0.2$ in $\Sigma_n$ says intercept and slope estimates are negatively correlated in the posterior (a small intercept goes with a large slope and vice versa); which makes sense for this dataset where the three points form a fairly tight line.

Predictive distribution at $x_* = (1, 1.5)^\top$ (predicting at the midpoint of the training $x$ range):

Mean: $x_*^\top \mu_n = 0.800 + 1.5 \cdot 1.267 = 0.800 + 1.900 = 2.700$ .
Variance: $\sigma^2 + x_*^\top \Sigma_n x_* = 1 + (1, 1.5) \begin{pmatrix} 0.4 & -0.2 \\ -0.2 & 0.267 \end{pmatrix} \begin{pmatrix} 1 \\ 1.5 \end{pmatrix}$ .

Compute the inner product: $\Sigma_n x_* = (0.4 \cdot 1 + (-0.2) \cdot 1.5, (-0.2) \cdot 1 + 0.267 \cdot 1.5)^\top = (0.100, 0.200)^\top$ , and $x_*^\top \Sigma_n x_* = 0.100 + 1.5 \cdot 0.200 = 0.400$ . So predictive variance $= 1.000 + 0.400 = 1.400$ , standard deviation $\approx 1.183$ .

So $y_* \mid x_* \sim \mathcal N(2.700, 1.400)$ , and a 95% predictive interval is roughly $[0.382, 5.018]$ ; wide because we have only three data points. With more data, $\Sigma_n$ shrinks and the epistemic component shrinks too; the irreducible $\sigma^2 = 1$ floor remains.

Predictive distribution at $x_* = (1, 10)^\top$ (extrapolating well past the training range):

Mean: $0.800 + 10 \cdot 1.267 = 13.470$ .
$\Sigma_n x_* = (0.4 - 2.0, -0.2 + 2.67)^\top = (-1.6, 2.47)^\top$ , and $x_*^\top \Sigma_n x_* = -1.6 + 10 \cdot 2.47 = 23.10$ .
Variance: $1 + 23.10 = 24.10$ , standard deviation $\approx 4.91$ .

The epistemic uncertainty at $x_* = 10$ is 23× the irreducible noise; this is the Bayesian penalty for extrapolating. A frequentist point estimate would happily report 13.47 with no signal that the model has zero confidence in this prediction. Bayesian regression makes the uncertainty explicit: 95% predictive interval $\approx [3.85, 23.10]$ . This is what makes Bayesian methods valuable for safety-critical or active-learning applications.

Marginal likelihood. Compute $\sigma^2 I + \tau^2 X X^\top$ with $\sigma^2 = \tau^2 = 1$ :

$X X^\top = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \\ 1 & 3 & 5 \end{pmatrix}, \qquad I + X X^\top = \begin{pmatrix} 2 & 1 & 1 \\ 1 & 3 & 3 \\ 1 & 3 & 6 \end{pmatrix}.$

Determinant: expand along the first row. $\det = 2(3 \cdot 6 - 3 \cdot 3) - 1(1 \cdot 6 - 3 \cdot 1) + 1(1 \cdot 3 - 3 \cdot 1) = 2 \cdot 9 - 1 \cdot 3 + 1 \cdot 0 = 15$ . So $\log |\sigma^2 I + \tau^2 X X^\top| = \log 15 \approx 2.708$ .

$y^\top (I + X X^\top)^{-1} y$ : invert the $3 \times 3$ matrix or solve the linear system. Solving by Gaussian elimination on $(I + X X^\top) v = y = (1, 2, 4)^\top$ gives $v \approx (0.067, 0.067, 0.633)^\top$ (verify: $2 \cdot 0.067 + 0.067 + 0.633 = 0.833$ ; should be 1; the small discrepancy is rounding from approximating the matrix inversion; exact arithmetic gives $v = (1/15, 1/15, 9.5/15)$ , leading to $y^\top v = 1/15 + 2/15 + 4 \cdot 9.5/15 = 41/15$ ).

Putting it together: $\log p(y \mid X, \sigma^2 = 1, \tau^2 = 1) \approx -\tfrac12 \cdot 41/15 - \tfrac12 \log 15 - \tfrac32 \log(2\pi) \approx -1.367 - 1.354 - 2.757 = -5.478$ .

To pick a better $\tau^2$ empirically, repeat for several values and maximize. This is the Type-II MLE on the prior strength.

Connections

Ridge regression (as MAP). $\mu_n = (X^\top X + \tfrac{\sigma^2}{\tau^2} I)^{-1} X^\top y$ is exactly the ridge estimator with $\lambda = \sigma^2/\tau^2$ . The MAP under a Gaussian prior equals the posterior mean here because the posterior is symmetric (Gaussian), so MAP estimation gives the same point as the posterior mean.
OLS (flat-prior limit). $\tau^2 \to \infty$ sends $\Sigma_0^{-1} \to 0$ and $\Sigma_n^{-1} \to \sigma^{-2} X^\top X$ , recovering OLS if $X^\top X$ is invertible. The Bayesian framework gracefully handles the $d > n$ case where OLS does not exist.
Kernel ridge regression. Replace $X$ with $\Phi$ , the feature-mapped design matrix, and you get kernel ridge: same algebra, different design. With a positive definite kernel and isotropic prior in feature space, you arrive at kernel ridge regression.
Gaussian processes. Take the infinite-feature limit (or specify a kernel directly, bypassing the explicit feature map) and you get a Gaussian process posterior. The GP predictive variance has the same epistemic-plus-aleatoric decomposition.
Empirical Bayes. Maximize the marginal likelihood over $(\sigma^2, \tau^2)$ to pick hyperparameters. This is the "Type-II MLE" or "evidence approximation" route; it generalizes to GPs as kernel hyperparameter selection by marginal-likelihood maximization.

Common Confusions

Watch Out

The posterior mean equals the ridge estimator, but the Bayesian framework is not 'just ridge'

Ridge regression gives a point. Bayesian linear regression gives a distribution: a posterior over $\beta$ with covariance $\Sigma_n$ , and a predictive distribution over $y_*$ with explicit aleatoric and epistemic variance. The point estimate is the same; the uncertainty quantification is what makes the Bayesian view useful for downstream tasks (active learning, BO, OOD detection). Saying "BLR is ridge with a clever interpretation" misses that ridge alone cannot produce a credible interval without auxiliary frequentist machinery (bootstrap, sandwich estimators).

Watch Out

The predictive variance has two parts, and they don't go to zero together

For a new $x_*$ , predictive variance is $\sigma^2 + x_*^\top \Sigma_n x_*$ . As $n \to \infty$ and the prior gets dominated, $\Sigma_n \to 0$ ; the epistemic part vanishes. But $\sigma^2$ is the irreducible noise floor: no amount of data lets you predict a new $y_*$ better than $\sigma^2$ . Confusing these is the root of the "I have a million data points, why is my prediction interval so wide?" mistake: the model has nailed $\beta$ , but $y$ itself is intrinsically noisy.

Watch Out

A flat prior is not always safe

The flat-prior limit $\tau^2 \to \infty$ recovers OLS if $X^\top X$ is invertible. In high dimensions ( $d > n$ ), $X^\top X$ is singular, the limit blows up, and the "least-committal" prior is in fact pathological; the posterior is improper and the marginal likelihood is zero. A proper prior is mathematically required, not just a stylistic choice, in the $d > n$ regime. Conversely, in the $d \ll n$ regime, the prior has negligible effect for moderate $\tau^2$ and the posterior is essentially the OLS sampling distribution.

Watch Out

Marginal likelihood is not the same as predictive accuracy

Marginal likelihood maximization picks hyperparameters that the prior thought were probable. Cross-validation picks hyperparameters that predict held-out data well. These usually agree but not always; particularly when the model is misspecified. For mission-critical predictive performance under misspecification, use cross-validation. For Bayesian model comparison or when the model is well-specified, marginal likelihood is the right choice.

Summary

Bayesian linear regression posterior is Gaussian: $\beta \mid y \sim \mathcal N(\mu_n, \Sigma_n)$ , with $\Sigma_n^{-1} = \Sigma_0^{-1} + \sigma^{-2} X^\top X$ and $\mu_n = \Sigma_n(\Sigma_0^{-1} \mu_0 + \sigma^{-2} X^\top y)$ .
Derived in one step by completing the square in the exponent of the log-posterior.
Posterior mean $\mu_n$ equals the ridge estimator with $\lambda = \sigma^2/\tau^2$ for the isotropic prior.
Predictive distribution: $y_* \mid x_* \sim \mathcal N(x_*^\top \mu_n, \sigma^2 + x_*^\top \Sigma_n x_*)$ . The variance decomposes into irreducible ( $\sigma^2$ ) plus epistemic ( $x_*^\top \Sigma_n x_*$ ).
Marginal likelihood $p(y \mid X) = \mathcal N(y; 0, \sigma^2 I + \tau^2 X X^\top)$ ; maximizing over $(\sigma^2, \tau^2)$ is empirical Bayes.
Generalizes cleanly: feature maps give kernel ridge regression, kernels in their own right give Gaussian processes, Normal-Inverse-Gamma priors handle unknown $\sigma^2$ .

Exercises

ExerciseCore

Problem

Verify by direct expansion that the posterior precision $\Sigma_n^{-1} = \Sigma_0^{-1} + \sigma^{-2} X^\top X$ falls out of the log-posterior. Start from the model $\beta \sim \mathcal N(\mu_0, \Sigma_0)$ and $y \mid X, \beta \sim \mathcal N(X\beta, \sigma^2 I)$ , write the log-posterior up to a $\beta$ -independent constant, and collect the $\beta^\top (\cdot) \beta$ coefficient.

ExerciseCore

Problem

Using the example dataset above ( $n=3$ , $X$ , $y$ , $\sigma^2 = \tau^2 = 1$ ), compute the predictive mean and variance at $x_* = (1, 0)$ . Compare to the value of $y_1$ in the training set.

ExerciseAdvanced

Problem

Show that the posterior covariance satisfies $\Sigma_n \preceq \Sigma_0$ in the Loewner order (i.e. $\Sigma_0 - \Sigma_n$ is positive semi-definite). Interpret: observing data never makes the posterior less informative than the prior.

ExerciseResearch

Problem

Suppose the noise variance $\sigma^2$ is unknown and you place a Normal-Inverse-Gamma prior $(\beta, \sigma^2) \sim \mathrm{NIG}(\mu_0, \Lambda_0, \alpha_0, \beta_0)$ . Derive the marginal posterior of $\beta$ after integrating out $\sigma^2$ . Verify that it is a multivariate Student's $t$ distribution and identify its degrees of freedom.

References

Canonical:

Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. §3.3 (the gold standard for the Bayesian linear regression derivation, predictive distribution, and evidence approximation).
Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. Ch. 14 (regression models).
Rasmussen, C.E., & Williams, C.K.I. (2006). Gaussian Processes for Machine Learning. MIT Press. Ch. 2 (Bayesian linear regression as the finite-dimensional case of a Gaussian process; the same algebra in different notation).
Lindley, D.V., & Smith, A.F.M. (1972). "Bayes Estimates for the Linear Model." J. Royal Statistical Society B, 34(1):1–41. (Original treatment of Bayesian regression and hierarchical models.)

Current:

Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Ch. 11 (linear regression: Bayesian and frequentist views unified).
Murphy, K.P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. Ch. 15 (Bayesian linear models in the high-dimensional and hierarchical settings).
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. §3.4 (ridge regression with the Bayesian connection as a sidebar).

Next Topics

Gaussian processes for ML: the kernel-ized, infinite-feature version of this page. The posterior derivation generalizes by replacing $X X^\top$ with the kernel Gram matrix.
Empirical Bayes vs hierarchical Bayes: how to handle hyperparameters $\sigma^2, \tau^2$ in a fully Bayesian way (place a prior on them) or in an empirical-Bayes way (maximize marginal likelihood).
Kernel trick: the same algebra in a feature space defined implicitly by a kernel.

Last reviewed: May 10, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

7

Conjugate Priorslayer 0B · tier 1
Maximum A Posteriori (MAP) Estimationlayer 0B · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
The Multivariate Normal Distributionlayer 0B · tier 1
Linear Regressionlayer 1 · tier 1

Derived topics

4

The Kernel Tricklayer 2 · tier 1
Empirical Bayes vs Hierarchical Bayeslayer 2 · tier 2
Gaussian Process Regressionlayer 3 · tier 2
Gaussian Processes for Machine Learninglayer 4 · tier 3

Graph-backed continuations

Gaussian Processes for Machine Learning Gaussian Process Regression Empirical Bayes vs Hierarchical Bayes The Kernel Trick