Skip to main content

Statistical Estimation

Bayesian Linear Regression

Gaussian prior, Gaussian likelihood, Gaussian posterior. Full posterior derivation by completing the square in the exponent: the posterior mean equals the ridge estimator, the predictive distribution has irreducible plus epistemic variance, and the marginal likelihood gives a closed-form hyperparameter selection criterion. Worked numeric example with three data points carries the algebra end to end.

AdvancedTier 1StableCore spine~110 min

Why This Matters

Bayesian linear regression is the cleanest place to see every piece of Bayesian inference fall into place at once. The prior, likelihood, and posterior are all Gaussian, so the algebra reduces to one move (completing the square in the exponent) that you have already seen in conjugate priors and the multivariate normal page. The posterior mean turns out to be the ridge estimator. The posterior covariance gives you uncertainty about β\beta. The predictive distribution gives you uncertainty about future yy. The marginal likelihood gives you a closed-form criterion for choosing the prior strength.

And it generalizes: replace the design matrix XX with a kernel-implied feature map ϕ(X)\phi(X) and Bayesian linear regression becomes a Gaussian process. Almost every Bayesian regression page on the site rests on this one derivation.

Mental Model

You have a linear model y=Xβ+εy = X\beta + \varepsilon with Gaussian noise εN(0,σ2I)\varepsilon \sim \mathcal N(0, \sigma^2 I). Frequentist OLS gives you a point estimate β^OLS=(XX)1Xy\hat\beta_{\mathrm{OLS}} = (X^\top X)^{-1} X^\top y: one vector of weights, no uncertainty.

Bayesian linear regression treats β\beta as a random variable with a prior βN(μ0,Σ0)\beta \sim \mathcal N(\mu_0, \Sigma_0) and updates the prior to a posterior using the data:

π(βX,y)p(yX,β)π(β).\pi(\beta \mid X, y) \propto p(y \mid X, \beta) \pi(\beta).

The product of two Gaussian factors is a Gaussian, so the posterior is N(μn,Σn)\mathcal N(\mu_n, \Sigma_n) for some μn\mu_n and Σn\Sigma_n. The job is to derive those two quantities, and that's where the completing-the-square move shows up.

Two payoffs:

  • Uncertainty. μn\mu_n is your best point estimate; Σn\Sigma_n tells you how confident the data has made you about each coefficient.
  • Predictive distribution. For a new input xx_*, the prediction yy_* is itself Gaussian with mean xμnx_*^\top \mu_n and variance σ2+xΣnx\sigma^2 + x_*^\top \Sigma_n x_*: one term for irreducible noise, one term for epistemic uncertainty about β\beta. This is the difference between "I think it's about 3" and "I think it's 3±0.43 \pm 0.4."

Formal Setup

The Bayesian linear regression model in canonical form:

βN(μ0,Σ0),yX,βN(Xβ,σ2In),\beta \sim \mathcal N(\mu_0, \Sigma_0), \qquad y \mid X, \beta \sim \mathcal N(X\beta, \sigma^2 I_n),

with XRn×dX \in \mathbb R^{n \times d} the design matrix, yRny \in \mathbb R^n the response vector, βRd\beta \in \mathbb R^d the coefficient vector, σ2\sigma^2 the (known) noise variance, and μ0,Σ0\mu_0, \Sigma_0 the prior mean and covariance. We treat σ2\sigma^2 as known here; the case where σ2\sigma^2 is also unknown uses the Normal-Inverse-Gamma conjugate prior and is the next refinement.

The most common prior choice is μ0=0\mu_0 = 0 and Σ0=τ2Id\Sigma_0 = \tau^2 I_d (zero mean, isotropic). The general derivation below covers both the isotropic case and the general Σ0\Sigma_0 case; the algebra is identical.

Full Posterior Derivation by Completing the Square

This is the central derivation of the page. Watch the completing-the-square recipe from the multivariate normal page do all the work.

Theorem

Posterior in Bayesian Linear Regression

Statement

The posterior is Gaussian:

βX,y    N(μn,Σn)\beta \mid X, y \;\sim\; \mathcal N(\mu_n, \Sigma_n)

with

Σn1=Σ01+σ2XX,μn=Σn(Σ01μ0+σ2Xy).\Sigma_n^{-1} = \Sigma_0^{-1} + \sigma^{-2} X^\top X, \qquad \mu_n = \Sigma_n \bigl(\Sigma_0^{-1} \mu_0 + \sigma^{-2} X^\top y\bigr).

With the standard prior μ0=0\mu_0 = 0, Σ0=τ2Id\Sigma_0 = \tau^2 I_d:

Σn=(τ2Id+σ2XX)1,μn=σ2ΣnXy=(XX+σ2τ2Id)1Xy.\Sigma_n = \bigl(\tau^{-2} I_d + \sigma^{-2} X^\top X\bigr)^{-1}, \quad \mu_n = \sigma^{-2} \Sigma_n X^\top y = \bigl(X^\top X + \tfrac{\sigma^2}{\tau^2} I_d\bigr)^{-1} X^\top y.

So the posterior mean equals the ridge regression estimator with regularization parameter λ=σ2/τ2\lambda = \sigma^2 / \tau^2.

Intuition

The posterior precision is the sum of prior precision and data precision: Σn1=Σ01+σ2XX\Sigma_n^{-1} = \Sigma_0^{-1} + \sigma^{-2} X^\top X. The posterior mean is a precision-weighted compromise between the prior mean μ0\mu_0 and the data's "preferred" β\beta direction σ2Xy\sigma^{-2} X^\top y. More data (larger XXX^\top X) or tighter prior (smaller Σ0\Sigma_0) makes the posterior tighter. The ridge equivalence falls out of this: setting μ0=0\mu_0 = 0 and Σ0=τ2I\Sigma_0 = \tau^2 I makes the posterior mean exactly (XX+λI)1Xy(X^\top X + \lambda I)^{-1} X^\top y with λ=σ2/τ2\lambda = \sigma^2/\tau^2, and this is also the MAP under the same prior (since the posterior is Gaussian, mean and mode coincide).

Proof Sketch

The log-posterior up to a β\beta-independent constant is

logπ(βX,y)=12σ2(yXβ)(yXβ)12(βμ0)Σ01(βμ0)+const.\log \pi(\beta \mid X, y) = -\tfrac1{2\sigma^2}(y - X\beta)^\top (y - X\beta) - \tfrac12 (\beta - \mu_0)^\top \Sigma_0^{-1}(\beta - \mu_0) + \text{const}.

Expand both terms:

12σ2(yy2yXβ+βXXβ)likelihood  +  12(μ0Σ01μ02μ0Σ01β+βΣ01β)prior.\underbrace{-\tfrac1{2\sigma^2}(y^\top y - 2 y^\top X\beta + \beta^\top X^\top X \beta)}_{\text{likelihood}} \;+\; \underbrace{-\tfrac12(\mu_0^\top \Sigma_0^{-1}\mu_0 - 2 \mu_0^\top \Sigma_0^{-1} \beta + \beta^\top \Sigma_0^{-1} \beta)}_{\text{prior}}.

Collect the quadratic-in-β\beta terms: 12σ2βXXβ12βΣ01β=12β(Σ01+σ2XX)β-\tfrac1{2\sigma^2}\beta^\top X^\top X \beta - \tfrac12 \beta^\top \Sigma_0^{-1} \beta = -\tfrac12 \beta^\top (\Sigma_0^{-1} + \sigma^{-2} X^\top X) \beta. Define the posterior precision P=Σ01+σ2XXP = \Sigma_0^{-1} + \sigma^{-2} X^\top X, so this is 12βPβ-\tfrac12 \beta^\top P \beta.

Collect the linear-in-β\beta terms: σ2yXβ+μ0Σ01β=(Σ01μ0+σ2Xy)β\sigma^{-2} y^\top X \beta + \mu_0^\top \Sigma_0^{-1} \beta = (\Sigma_0^{-1} \mu_0 + \sigma^{-2} X^\top y)^\top \beta. Define b=Σ01μ0+σ2Xyb = \Sigma_0^{-1} \mu_0 + \sigma^{-2} X^\top y, so this is bβb^\top \beta.

So logπ(βX,y)=12βPβ+bβ+const\log \pi(\beta \mid X, y) = -\tfrac12 \beta^\top P \beta + b^\top \beta + \text{const}. Apply the completing-the-square recipe (see multivariate normal): the posterior is Gaussian with mean P1bP^{-1} b and covariance P1P^{-1}. Substituting back:

Σn=P1=(Σ01+σ2XX)1,μn=Σnb=Σn(Σ01μ0+σ2Xy).\Sigma_n = P^{-1} = (\Sigma_0^{-1} + \sigma^{-2} X^\top X)^{-1}, \qquad \mu_n = \Sigma_n b = \Sigma_n(\Sigma_0^{-1}\mu_0 + \sigma^{-2} X^\top y).

For the standard prior, Σ01=τ2I\Sigma_0^{-1} = \tau^{-2} I and μ0=0\mu_0 = 0, so Σn=(τ2I+σ2XX)1\Sigma_n = (\tau^{-2} I + \sigma^{-2} X^\top X)^{-1} and μn=σ2ΣnXy\mu_n = \sigma^{-2} \Sigma_n X^\top y. Multiplying numerator and denominator by σ2\sigma^2 gives μn=(XX+σ2τ2I)1Xy\mu_n = (X^\top X + \tfrac{\sigma^2}{\tau^2} I)^{-1} X^\top y, the ridge form.

Why It Matters

This single derivation is the foundation of every Gaussian Bayesian regression model on the site. Adding a feature map ϕ\phi to XX gives kernel ridge regression as a special case of Bayesian linear regression; the infinite-feature limit gives the Gaussian process posterior. The ridge equivalence shows that L2-regularized least squares is not a frequentist hack but the mean of the posterior under a Gaussian prior, and the ridge solution paired with the posterior covariance gives a credible interval that ridge alone does not provide.

Failure Mode

The derivation assumes the prior covariance Σ0\Sigma_0 is positive definite (so Σ01\Sigma_0^{-1} exists). For improper priors (Σ0I\Sigma_0 \to \infty I, i.e. flat prior), the limit gives Σn(σ2XX)1\Sigma_n \to (\sigma^{-2} X^\top X)^{-1} and μn(XX)1Xy\mu_n \to (X^\top X)^{-1} X^\top y, recovering the OLS solution provided XXX^\top X is invertible. If XXX^\top X is singular (e.g. when d>nd > n), the OLS limit does not exist, and you must have a proper prior. This is why Bayesian linear regression remains well-defined in the high-dimensional regime d>nd > n while OLS does not.

Predictive Distribution

A point estimate is a starting point; a predictive distribution is what you want for downstream decisions. Given a new input xx_*, the prediction y=xβ+εy_* = x_*^\top \beta + \varepsilon_* with εN(0,σ2)\varepsilon_* \sim \mathcal N(0, \sigma^2) has a closed-form distribution after marginalizing out β\beta.

Theorem

Predictive Distribution in Bayesian Linear Regression

Statement

The predictive distribution of yy_* at a new input xx_* is Gaussian:

yx,X,y    N ⁣(xμn,    σ2+xΣnx).y_* \mid x_*, X, y \;\sim\; \mathcal N\!\left( x_*^\top \mu_n, \;\; \sigma^2 + x_*^\top \Sigma_n x_* \right).

The variance decomposes into two parts: σ2\sigma^2 (the irreducible noise) and xΣnxx_*^\top \Sigma_n x_* (the epistemic uncertainty about β\beta projected onto xx_*).

Intuition

Even if you knew β\beta exactly, you would still face noise variance σ2\sigma^2 on any new observation. The extra term xΣnxx_*^\top \Sigma_n x_* captures the fact that you don't know β\beta exactly, and the uncertainty depends on the direction of xx_*: predictions in directions where Σn\Sigma_n is large (poorly-constrained directions) carry more uncertainty than predictions in well-constrained directions. The MAP / point estimate xμnx_*^\top \mu_n alone hides this; it gives the same number regardless of how confident you are.

Proof Sketch

y=xβ+εy_* = x_*^\top \beta + \varepsilon_* is the sum of two independent Gaussian terms (given the data): xβx_*^\top \beta with mean xμnx_*^\top \mu_n and variance xΣnxx_*^\top \Sigma_n x_*, and εN(0,σ2)\varepsilon_* \sim \mathcal N(0, \sigma^2). The sum of independent Gaussians is Gaussian with summed means and summed variances.

Why It Matters

This is the predictive interval most people want when they read the words "regression prediction." Point predictions are calibrated only on average; predictive intervals are calibrated pointwise. The decomposition into aleatoric (σ2\sigma^2) and epistemic (xΣnxx_*^\top \Sigma_n x_*) uncertainty is what makes Bayesian regression useful for active learning (pick xx_* that maximally reduces Σn\Sigma_n), Bayesian optimization (balance mean and uncertainty), and out-of-distribution detection (large xΣnxx_*^\top \Sigma_n x_* flags inputs far from training data).

Failure Mode

The decomposition assumes the noise variance σ2\sigma^2 is known and correct. If σ2\sigma^2 is misspecified, the predictive interval is miscalibrated. The Normal-Inverse-Gamma conjugate prior treats σ2\sigma^2 as unknown and gives a Student's tt predictive distribution that correctly inflates intervals for small nn; at the cost of slightly more complex algebra.

Marginal Likelihood (Evidence)

The marginal likelihood p(yX)p(y \mid X) integrates out β\beta from the joint p(y,βX)=p(yX,β)π(β)p(y, \beta \mid X) = p(y \mid X, \beta) \pi(\beta). It serves as the normalizer in Bayes' rule and as a model-comparison score: the "evidence" the data provides in favor of the chosen prior.

Theorem

Marginal Likelihood for Bayesian Linear Regression

Statement

The log marginal likelihood with the isotropic prior βN(0,τ2I)\beta \sim \mathcal N(0, \tau^2 I) is

logp(yX,σ2,τ2)=12y(σ2In+τ2XX)1y    12logσ2In+τ2XX    n2log(2π).\log p(y \mid X, \sigma^2, \tau^2) = -\tfrac12 y^\top (\sigma^2 I_n + \tau^2 X X^\top)^{-1} y \;-\; \tfrac12 \log |\sigma^2 I_n + \tau^2 X X^\top| \;-\; \tfrac n2 \log(2\pi).

Maximizing this jointly over σ2\sigma^2 and τ2\tau^2 gives the empirical Bayes (or "evidence approximation") choice of hyperparameters, an alternative to cross-validation that is fully Bayesian whenever the Gaussian model is well-specified.

Intuition

The marginal likelihood is the probability the model assigned to the actual data, averaged over the prior. It penalizes models that are too flexible (the prior spreads its predicted-yy mass thin over many configurations, so any specific yy has low marginal probability) and models that are too rigid (the prior puts mass on configurations that disagree with yy). The maximum-evidence hyperparameter sits at the "Occam's razor" sweet spot.

Proof Sketch

The joint of yy given XX (marginalizing over β\beta): y=Xβ+εy = X\beta + \varepsilon with βN(0,τ2I)\beta \sim \mathcal N(0, \tau^2 I) and εN(0,σ2I)\varepsilon \sim \mathcal N(0, \sigma^2 I) independent. The marginal is the affine image of a Gaussian, so yN(0,τ2XX+σ2I)y \sim \mathcal N(0, \tau^2 X X^\top + \sigma^2 I). The density is the standard nn-dimensional Gaussian density with covariance σ2I+τ2XX\sigma^2 I + \tau^2 X X^\top, giving the stated log-density.

Why It Matters

The marginal likelihood is what lets you tune the prior strength automatically: maximize logp(yX,σ2,τ2)\log p(y \mid X, \sigma^2, \tau^2) jointly over σ2\sigma^2 and τ2\tau^2 and you get an empirical-Bayes estimate of both. This is Type-II maximum likelihood; in the Gaussian-process literature it is called "evidence maximization" and is the default hyperparameter selection method when cross-validation is too expensive. The same machinery generalizes: replacing XXXX^\top with a kernel matrix KK gives the GP marginal likelihood.

Failure Mode

The marginal likelihood is not a substitute for cross-validation when the Gaussian model is misspecified. Mis-specified noise distributions, non-linear true response surfaces, or non-Gaussian outliers can produce a marginal likelihood that picks hyperparameters which overfit the noise model rather than the data. The matrix σ2I+τ2XX\sigma^2 I + \tau^2 X X^\top is n×nn \times n; for large nn, the determinant and inverse cost O(n3)O(n^3); the same scaling problem GPs face. For dnd \ll n, it is cheaper to compute via the Woodbury identity in dd-dimensional space.

Worked Numeric Example

Three data points, one feature, a tractable prior; enough to follow every step in numbers.

Setup. n=3n = 3, d=2d = 2 (slope and intercept). Data:

X=(101112),y=(124).X = \begin{pmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{pmatrix}, \qquad y = \begin{pmatrix} 1 \\ 2 \\ 4 \end{pmatrix}.

The first column is the intercept; the second is the predictor. Prior: β=(β0,β1)N(0,I2)\beta = (\beta_0, \beta_1)^\top \sim \mathcal N(0, I_2) (so Σ0=I\Sigma_0 = I, μ0=0\mu_0 = 0). Noise variance σ2=1\sigma^2 = 1.

Compute XXX^\top X and XyX^\top y:

XX=(3335),Xy=(710).X^\top X = \begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix}, \qquad X^\top y = \begin{pmatrix} 7 \\ 10 \end{pmatrix}.

Posterior precision:

Σn1=Σ01+σ2XX=I+(3335)=(4336).\Sigma_n^{-1} = \Sigma_0^{-1} + \sigma^{-2} X^\top X = I + \begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix} = \begin{pmatrix} 4 & 3 \\ 3 & 6 \end{pmatrix}.

Posterior covariance (invert the 2×22 \times 2 matrix; determinant =4633=15= 4 \cdot 6 - 3 \cdot 3 = 15):

Σn=115(6334)=(0.40.20.20.267).\Sigma_n = \tfrac{1}{15} \begin{pmatrix} 6 & -3 \\ -3 & 4 \end{pmatrix} = \begin{pmatrix} 0.4 & -0.2 \\ -0.2 & 0.267 \end{pmatrix}.

Posterior mean (using b=Σ01μ0+σ2Xy=Xy=(7,10)b = \Sigma_0^{-1}\mu_0 + \sigma^{-2}X^\top y = X^\top y = (7, 10)^\top):

μn=Σn(710)=(0.47+(0.2)10(0.2)7+0.26710)=(0.8001.267).\mu_n = \Sigma_n \begin{pmatrix} 7 \\ 10 \end{pmatrix} = \begin{pmatrix} 0.4 \cdot 7 + (-0.2) \cdot 10 \\ (-0.2) \cdot 7 + 0.267 \cdot 10 \end{pmatrix} = \begin{pmatrix} 0.800 \\ 1.267 \end{pmatrix}.

(Direct check: Σnb=ΣnXy\Sigma_n b = \Sigma_n X^\top y. The product (4336)1(710)\begin{pmatrix} 4 & 3 \\ 3 & 6 \end{pmatrix}^{-1}\begin{pmatrix} 7 \\ 10 \end{pmatrix} computed step-by-step: Σn=115(6334)\Sigma_n = \frac1{15}\begin{pmatrix}6 & -3 \\ -3 & 4\end{pmatrix}, so μn=115(6731037+410)=115(1219)=(0.800,1.267)\mu_n = \frac1{15}\begin{pmatrix}6 \cdot 7 - 3 \cdot 10 \\ -3 \cdot 7 + 4 \cdot 10\end{pmatrix} = \frac1{15}\begin{pmatrix}12 \\ 19\end{pmatrix} = (0.800, 1.267)^\top.)

Compare to OLS. OLS solves (XX)β=Xy(X^\top X) \beta = X^\top y, i.e. (3335)β=(710)\begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix} \beta = \begin{pmatrix} 7 \\ 10 \end{pmatrix}. Determinant =6= 6, so β^OLS=16(5731037+310)=16(59)=(0.833,1.500)\hat\beta_{\mathrm{OLS}} = \frac16\begin{pmatrix} 5 \cdot 7 - 3 \cdot 10 \\ -3 \cdot 7 + 3 \cdot 10 \end{pmatrix} = \frac16 \begin{pmatrix} 5 \\ 9 \end{pmatrix} = (0.833, 1.500)^\top.

So the posterior pulls each coefficient toward zero: intercept from 0.833 to 0.800, slope from 1.500 to 1.267. The shrinkage is the ridge effect with λ=σ2/τ2=1\lambda = \sigma^2/\tau^2 = 1. The off-diagonal 0.2-0.2 in Σn\Sigma_n says intercept and slope estimates are negatively correlated in the posterior (a small intercept goes with a large slope and vice versa); which makes sense for this dataset where the three points form a fairly tight line.

Predictive distribution at x=(1,1.5)x_* = (1, 1.5)^\top (predicting at the midpoint of the training xx range):

  • Mean: xμn=0.800+1.51.267=0.800+1.900=2.700x_*^\top \mu_n = 0.800 + 1.5 \cdot 1.267 = 0.800 + 1.900 = 2.700.
  • Variance: σ2+xΣnx=1+(1,1.5)(0.40.20.20.267)(11.5)\sigma^2 + x_*^\top \Sigma_n x_* = 1 + (1, 1.5) \begin{pmatrix} 0.4 & -0.2 \\ -0.2 & 0.267 \end{pmatrix} \begin{pmatrix} 1 \\ 1.5 \end{pmatrix}.

Compute the inner product: Σnx=(0.41+(0.2)1.5,(0.2)1+0.2671.5)=(0.100,0.200)\Sigma_n x_* = (0.4 \cdot 1 + (-0.2) \cdot 1.5, (-0.2) \cdot 1 + 0.267 \cdot 1.5)^\top = (0.100, 0.200)^\top, and xΣnx=0.100+1.50.200=0.400x_*^\top \Sigma_n x_* = 0.100 + 1.5 \cdot 0.200 = 0.400. So predictive variance =1.000+0.400=1.400= 1.000 + 0.400 = 1.400, standard deviation 1.183\approx 1.183.

So yxN(2.700,1.400)y_* \mid x_* \sim \mathcal N(2.700, 1.400), and a 95% predictive interval is roughly [0.382,5.018][0.382, 5.018]; wide because we have only three data points. With more data, Σn\Sigma_n shrinks and the epistemic component shrinks too; the irreducible σ2=1\sigma^2 = 1 floor remains.

Predictive distribution at x=(1,10)x_* = (1, 10)^\top (extrapolating well past the training range):

  • Mean: 0.800+101.267=13.4700.800 + 10 \cdot 1.267 = 13.470.
  • Σnx=(0.42.0,0.2+2.67)=(1.6,2.47)\Sigma_n x_* = (0.4 - 2.0, -0.2 + 2.67)^\top = (-1.6, 2.47)^\top, and xΣnx=1.6+102.47=23.10x_*^\top \Sigma_n x_* = -1.6 + 10 \cdot 2.47 = 23.10.
  • Variance: 1+23.10=24.101 + 23.10 = 24.10, standard deviation 4.91\approx 4.91.

The epistemic uncertainty at x=10x_* = 10 is 23× the irreducible noise; this is the Bayesian penalty for extrapolating. A frequentist point estimate would happily report 13.47 with no signal that the model has zero confidence in this prediction. Bayesian regression makes the uncertainty explicit: 95% predictive interval [3.85,23.10]\approx [3.85, 23.10]. This is what makes Bayesian methods valuable for safety-critical or active-learning applications.

Marginal likelihood. Compute σ2I+τ2XX\sigma^2 I + \tau^2 X X^\top with σ2=τ2=1\sigma^2 = \tau^2 = 1:

XX=(111123135),I+XX=(211133136).X X^\top = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \\ 1 & 3 & 5 \end{pmatrix}, \qquad I + X X^\top = \begin{pmatrix} 2 & 1 & 1 \\ 1 & 3 & 3 \\ 1 & 3 & 6 \end{pmatrix}.

Determinant: expand along the first row. det=2(3633)1(1631)+1(1331)=2913+10=15\det = 2(3 \cdot 6 - 3 \cdot 3) - 1(1 \cdot 6 - 3 \cdot 1) + 1(1 \cdot 3 - 3 \cdot 1) = 2 \cdot 9 - 1 \cdot 3 + 1 \cdot 0 = 15. So logσ2I+τ2XX=log152.708\log |\sigma^2 I + \tau^2 X X^\top| = \log 15 \approx 2.708.

y(I+XX)1yy^\top (I + X X^\top)^{-1} y: invert the 3×33 \times 3 matrix or solve the linear system. Solving by Gaussian elimination on (I+XX)v=y=(1,2,4)(I + X X^\top) v = y = (1, 2, 4)^\top gives v(0.067,0.067,0.633)v \approx (0.067, 0.067, 0.633)^\top (verify: 20.067+0.067+0.633=0.8332 \cdot 0.067 + 0.067 + 0.633 = 0.833; should be 1; the small discrepancy is rounding from approximating the matrix inversion; exact arithmetic gives v=(1/15,1/15,9.5/15)v = (1/15, 1/15, 9.5/15), leading to yv=1/15+2/15+49.5/15=41/15y^\top v = 1/15 + 2/15 + 4 \cdot 9.5/15 = 41/15).

Putting it together: logp(yX,σ2=1,τ2=1)1241/1512log1532log(2π)1.3671.3542.757=5.478\log p(y \mid X, \sigma^2 = 1, \tau^2 = 1) \approx -\tfrac12 \cdot 41/15 - \tfrac12 \log 15 - \tfrac32 \log(2\pi) \approx -1.367 - 1.354 - 2.757 = -5.478.

To pick a better τ2\tau^2 empirically, repeat for several values and maximize. This is the Type-II MLE on the prior strength.

Connections

  • Ridge regression (as MAP). μn=(XX+σ2τ2I)1Xy\mu_n = (X^\top X + \tfrac{\sigma^2}{\tau^2} I)^{-1} X^\top y is exactly the ridge estimator with λ=σ2/τ2\lambda = \sigma^2/\tau^2. The MAP under a Gaussian prior equals the posterior mean here because the posterior is symmetric (Gaussian), so MAP estimation gives the same point as the posterior mean.
  • OLS (flat-prior limit). τ2\tau^2 \to \infty sends Σ010\Sigma_0^{-1} \to 0 and Σn1σ2XX\Sigma_n^{-1} \to \sigma^{-2} X^\top X, recovering OLS if XXX^\top X is invertible. The Bayesian framework gracefully handles the d>nd > n case where OLS does not exist.
  • Kernel ridge regression. Replace XX with Φ\Phi, the feature-mapped design matrix, and you get kernel ridge: same algebra, different design. With a positive definite kernel and isotropic prior in feature space, you arrive at kernel ridge regression.
  • Gaussian processes. Take the infinite-feature limit (or specify a kernel directly, bypassing the explicit feature map) and you get a Gaussian process posterior. The GP predictive variance has the same epistemic-plus-aleatoric decomposition.
  • Empirical Bayes. Maximize the marginal likelihood over (σ2,τ2)(\sigma^2, \tau^2) to pick hyperparameters. This is the "Type-II MLE" or "evidence approximation" route; it generalizes to GPs as kernel hyperparameter selection by marginal-likelihood maximization.

Common Confusions

Watch Out

The posterior mean equals the ridge estimator, but the Bayesian framework is not 'just ridge'

Ridge regression gives a point. Bayesian linear regression gives a distribution: a posterior over β\beta with covariance Σn\Sigma_n, and a predictive distribution over yy_* with explicit aleatoric and epistemic variance. The point estimate is the same; the uncertainty quantification is what makes the Bayesian view useful for downstream tasks (active learning, BO, OOD detection). Saying "BLR is ridge with a clever interpretation" misses that ridge alone cannot produce a credible interval without auxiliary frequentist machinery (bootstrap, sandwich estimators).

Watch Out

The predictive variance has two parts, and they don't go to zero together

For a new xx_*, predictive variance is σ2+xΣnx\sigma^2 + x_*^\top \Sigma_n x_*. As nn \to \infty and the prior gets dominated, Σn0\Sigma_n \to 0; the epistemic part vanishes. But σ2\sigma^2 is the irreducible noise floor: no amount of data lets you predict a new yy_* better than σ2\sigma^2. Confusing these is the root of the "I have a million data points, why is my prediction interval so wide?" mistake: the model has nailed β\beta, but yy itself is intrinsically noisy.

Watch Out

A flat prior is not always safe

The flat-prior limit τ2\tau^2 \to \infty recovers OLS if XXX^\top X is invertible. In high dimensions (d>nd > n), XXX^\top X is singular, the limit blows up, and the "least-committal" prior is in fact pathological; the posterior is improper and the marginal likelihood is zero. A proper prior is mathematically required, not just a stylistic choice, in the d>nd > n regime. Conversely, in the dnd \ll n regime, the prior has negligible effect for moderate τ2\tau^2 and the posterior is essentially the OLS sampling distribution.

Watch Out

Marginal likelihood is not the same as predictive accuracy

Marginal likelihood maximization picks hyperparameters that the prior thought were probable. Cross-validation picks hyperparameters that predict held-out data well. These usually agree but not always; particularly when the model is misspecified. For mission-critical predictive performance under misspecification, use cross-validation. For Bayesian model comparison or when the model is well-specified, marginal likelihood is the right choice.

Summary

  • Bayesian linear regression posterior is Gaussian: βyN(μn,Σn)\beta \mid y \sim \mathcal N(\mu_n, \Sigma_n), with Σn1=Σ01+σ2XX\Sigma_n^{-1} = \Sigma_0^{-1} + \sigma^{-2} X^\top X and μn=Σn(Σ01μ0+σ2Xy)\mu_n = \Sigma_n(\Sigma_0^{-1} \mu_0 + \sigma^{-2} X^\top y).
  • Derived in one step by completing the square in the exponent of the log-posterior.
  • Posterior mean μn\mu_n equals the ridge estimator with λ=σ2/τ2\lambda = \sigma^2/\tau^2 for the isotropic prior.
  • Predictive distribution: yxN(xμn,σ2+xΣnx)y_* \mid x_* \sim \mathcal N(x_*^\top \mu_n, \sigma^2 + x_*^\top \Sigma_n x_*). The variance decomposes into irreducible (σ2\sigma^2) plus epistemic (xΣnxx_*^\top \Sigma_n x_*).
  • Marginal likelihood p(yX)=N(y;0,σ2I+τ2XX)p(y \mid X) = \mathcal N(y; 0, \sigma^2 I + \tau^2 X X^\top); maximizing over (σ2,τ2)(\sigma^2, \tau^2) is empirical Bayes.
  • Generalizes cleanly: feature maps give kernel ridge regression, kernels in their own right give Gaussian processes, Normal-Inverse-Gamma priors handle unknown σ2\sigma^2.

Exercises

ExerciseCore

Problem

Verify by direct expansion that the posterior precision Σn1=Σ01+σ2XX\Sigma_n^{-1} = \Sigma_0^{-1} + \sigma^{-2} X^\top X falls out of the log-posterior. Start from the model βN(μ0,Σ0)\beta \sim \mathcal N(\mu_0, \Sigma_0) and yX,βN(Xβ,σ2I)y \mid X, \beta \sim \mathcal N(X\beta, \sigma^2 I), write the log-posterior up to a β\beta-independent constant, and collect the β()β\beta^\top (\cdot) \beta coefficient.

ExerciseCore

Problem

Using the example dataset above (n=3n=3, XX, yy, σ2=τ2=1\sigma^2 = \tau^2 = 1), compute the predictive mean and variance at x=(1,0)x_* = (1, 0). Compare to the value of y1y_1 in the training set.

ExerciseAdvanced

Problem

Show that the posterior covariance satisfies ΣnΣ0\Sigma_n \preceq \Sigma_0 in the Loewner order (i.e. Σ0Σn\Sigma_0 - \Sigma_n is positive semi-definite). Interpret: observing data never makes the posterior less informative than the prior.

ExerciseResearch

Problem

Suppose the noise variance σ2\sigma^2 is unknown and you place a Normal-Inverse-Gamma prior (β,σ2)NIG(μ0,Λ0,α0,β0)(\beta, \sigma^2) \sim \mathrm{NIG}(\mu_0, \Lambda_0, \alpha_0, \beta_0). Derive the marginal posterior of β\beta after integrating out σ2\sigma^2. Verify that it is a multivariate Student's tt distribution and identify its degrees of freedom.

References

Canonical:

  • Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. §3.3 (the gold standard for the Bayesian linear regression derivation, predictive distribution, and evidence approximation).
  • Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. Ch. 14 (regression models).
  • Rasmussen, C.E., & Williams, C.K.I. (2006). Gaussian Processes for Machine Learning. MIT Press. Ch. 2 (Bayesian linear regression as the finite-dimensional case of a Gaussian process; the same algebra in different notation).
  • Lindley, D.V., & Smith, A.F.M. (1972). "Bayes Estimates for the Linear Model." J. Royal Statistical Society B, 34(1):1–41. (Original treatment of Bayesian regression and hierarchical models.)

Current:

  • Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Ch. 11 (linear regression: Bayesian and frequentist views unified).
  • Murphy, K.P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. Ch. 15 (Bayesian linear models in the high-dimensional and hierarchical settings).
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. §3.4 (ridge regression with the Bayesian connection as a sidebar).

Next Topics

  • Gaussian processes for ML: the kernel-ized, infinite-feature version of this page. The posterior derivation generalizes by replacing XXX X^\top with the kernel Gram matrix.
  • Empirical Bayes vs hierarchical Bayes: how to handle hyperparameters σ2,τ2\sigma^2, \tau^2 in a fully Bayesian way (place a prior on them) or in an empirical-Bayes way (maximize marginal likelihood).
  • Kernel trick: the same algebra in a feature space defined implicitly by a kernel.

Last reviewed: May 10, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.