Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Regression Methods

Ridge Regression

OLS with L2 regularization: closed-form shrinkage, bias-variance tradeoff, SVD interpretation, and the Bayesian connection to Gaussian priors.

CoreTier 1Stable~50 min

Why This Matters

OLS is unbiased, but unbiasedness is not always desirable. When features are correlated, when dd is large relative to nn, or when XXX^\top X is ill-conditioned, OLS has high variance and often poor predictive performance. Ridge regression trades a small amount of bias for a large reduction in variance. And in terms of mean squared error, it can dominate OLS.

Ridge is the simplest regularization method, the one where everything has a closed form, and the entry point for understanding why regularization works.

Ridge (L2)||w||₂² ≤ tshrunk solutionwwLasso (L1)||w||₁ ≤ tsparse solution (w₂=0)wwL1 diamond has corners on axes, so the constrained optimum often lands on an axis (sparsity). L2 circle has no corners.

Mental Model

OLS finds the minimum-norm solution to yXwy \approx Xw. Ridge adds a penalty that pulls ww toward zero: among all ww that explain the data reasonably well, pick the one with small w2\|w\|_2. The penalty strength λ\lambda controls the tradeoff. more penalty means smaller weights (less variance) but worse fit to training data (more bias).

Geometrically: OLS projects yy onto the column space of XX. Ridge "shrinks" this projection, pulling y^\hat{y} slightly toward the origin.

Formal Setup

We have the linear model y=Xw+εy = Xw^* + \varepsilon where XRn×dX \in \mathbb{R}^{n \times d}, yRny \in \mathbb{R}^n, ε(0,σ2I)\varepsilon \sim (0, \sigma^2 I).

Definition

Ridge Regression

The ridge estimator with regularization parameter λ>0\lambda > 0 minimizes:

w^ridge=argminwRdyXw22+λw22\hat{w}_{\text{ridge}} = \arg\min_{w \in \mathbb{R}^d} \|y - Xw\|_2^2 + \lambda \|w\|_2^2

The objective is strictly convex (for λ>0\lambda > 0), so the solution is unique. Setting the gradient to zero:

2X(yXw)+2λw=0    (XX+λI)w=Xy-2X^\top(y - Xw) + 2\lambda w = 0 \implies (X^\top X + \lambda I)w = X^\top y

The closed-form solution is:

w^ridge=(XX+λI)1Xy\hat{w}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y

The matrix XX+λIX^\top X + \lambda I is always invertible for λ>0\lambda > 0, even when XXX^\top X is singular (e.g., when n<dn < d).

SVD Interpretation

The SVD perspective is the cleanest way to understand what ridge does. Write the SVD of X=UΣVX = U \Sigma V^\top where URn×rU \in \mathbb{R}^{n \times r}, Σ=diag(σ1,,σr)\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_r), VRd×rV \in \mathbb{R}^{d \times r}, and r=rank(X)r = \text{rank}(X).

Definition

Ridge Shrinkage Factors

In the SVD basis, the OLS fitted values are y^OLS=j=1rujujy\hat{y}_{\text{OLS}} = \sum_{j=1}^r u_j u_j^\top y, a full projection onto each left singular direction. Ridge modifies this to:

y^ridge=j=1rσj2σj2+λujujy\hat{y}_{\text{ridge}} = \sum_{j=1}^r \frac{\sigma_j^2}{\sigma_j^2 + \lambda}\, u_j u_j^\top y

The shrinkage factor for direction jj is:

dj(λ)=σj2σj2+λd_j(\lambda) = \frac{\sigma_j^2}{\sigma_j^2 + \lambda}

This satisfies 0<dj(λ)<10 < d_j(\lambda) < 1. Directions with large singular values (σjλ\sigma_j \gg \sqrt{\lambda}) are barely shrunk: dj1d_j \approx 1. Directions with small singular values (σjλ\sigma_j \ll \sqrt{\lambda}) are heavily shrunk: dj0d_j \approx 0.

In the coefficient space: The ridge estimate in the VV-basis is:

Vw^ridge=diag ⁣(σjσj2+λ)UyV^\top \hat{w}_{\text{ridge}} = \text{diag}\!\left(\frac{\sigma_j}{\sigma_j^2 + \lambda}\right) U^\top y

Compare to OLS: Vw^OLS=diag(1/σj)UyV^\top \hat{w}_{\text{OLS}} = \text{diag}(1/\sigma_j) U^\top y. Ridge replaces 1/σj1/\sigma_j with σj/(σj2+λ)\sigma_j/(\sigma_j^2 + \lambda), which is bounded even when σj\sigma_j is small. This is why ridge stabilizes ill-conditioned problems.

Bias-Variance Tradeoff

Definition

Ridge Bias and Variance

Under the model y=Xw+εy = Xw^* + \varepsilon with ε(0,σ2I)\varepsilon \sim (0, \sigma^2 I):

Bias: E[w^ridge]w=λ(XX+λI)1w\mathbb{E}[\hat{w}_{\text{ridge}}] - w^* = -\lambda(X^\top X + \lambda I)^{-1} w^*

This is nonzero for λ>0\lambda > 0. ridge is biased. The bias increases with λ\lambda.

Variance: Var(w^ridge)=σ2(XX+λI)1XX(XX+λI)1\text{Var}(\hat{w}_{\text{ridge}}) = \sigma^2 (X^\top X + \lambda I)^{-1} X^\top X (X^\top X + \lambda I)^{-1}

Compare to OLS variance: σ2(XX)1\sigma^2 (X^\top X)^{-1}. The ridge variance is smaller in the positive semidefinite sense.

MSE: MSE(w^ridge)=Bias2+Variance\text{MSE}(\hat{w}_{\text{ridge}}) = \text{Bias}^2 + \text{Variance}. For some λ>0\lambda > 0, this is strictly less than MSE(w^OLS)\text{MSE}(\hat{w}_{\text{OLS}}).

Main Theorems

Theorem

Ridge Dominates OLS in MSE

Statement

Under the linear model y=Xw+εy = Xw^* + \varepsilon with E[ε]=0\mathbb{E}[\varepsilon] = 0 and Var(ε)=σ2I\text{Var}(\varepsilon) = \sigma^2 I, there always exists a λ>0\lambda > 0 such that:

MSE(w^ridge)<MSE(w^OLS)\text{MSE}(\hat{w}_{\text{ridge}}) < \text{MSE}(\hat{w}_{\text{OLS}})

where MSE(w)=E[ww22]\text{MSE}(w) = \mathbb{E}[\|w - w^*\|_2^2].

More precisely, using the SVD, if σ1σd>0\sigma_1 \geq \cdots \geq \sigma_d > 0 are the singular values of XX and θj=vjw\theta_j = v_j^\top w^* are the true coefficients in the SVD basis, then:

MSE(w^ridge)=j=1d[λ2θj2(σj2+λ)2+σ2σj2(σj2+λ)2]\text{MSE}(\hat{w}_{\text{ridge}}) = \sum_{j=1}^d \left[\frac{\lambda^2 \theta_j^2}{(\sigma_j^2 + \lambda)^2} + \frac{\sigma^2 \sigma_j^2}{(\sigma_j^2 + \lambda)^2}\right]

The first term is the squared bias for direction jj; the second is the variance.

Intuition

OLS is BLUE (Gauss-Markov), but "best unbiased" does not mean "best overall." By allowing a small bias, ridge dramatically reduces variance. The MSE tradeoff is favorable because the bias grows linearly in λ\lambda (from zero) while the variance decreases. at λ=0\lambda = 0, the derivative of MSE with respect to λ\lambda is always negative, so a small positive λ\lambda always helps.

Proof Sketch

Write w^ridge=Wλy\hat{w}_{\text{ridge}} = W_\lambda y where Wλ=(XX+λI)1XW_\lambda = (X^\top X + \lambda I)^{-1} X^\top. The MSE decomposes as:

MSE=Bias22+tr(Var)=λ2w(XX+λI)2w+σ2tr ⁣[(XX+λI)2XX]\text{MSE} = \|\text{Bias}\|_2^2 + \text{tr}(\text{Var}) = \lambda^2 w^{*\top}(X^\top X + \lambda I)^{-2} w^* + \sigma^2 \text{tr}\!\left[(X^\top X + \lambda I)^{-2} X^\top X\right]

Diagonalize using the SVD to get the per-component formula. Differentiate with respect to λ\lambda at λ=0\lambda = 0:

ddλMSEλ=0=j[2σ2σj2]+(bias terms at λ=0 are zero)\frac{d}{d\lambda}\text{MSE}\bigg|_{\lambda = 0} = \sum_j \left[\frac{-2\sigma^2}{\sigma_j^2}\right] + \text{(bias terms at } \lambda = 0 \text{ are zero)}

This derivative is strictly negative, so MSE decreases for small λ>0\lambda > 0.

Why It Matters

This is the theoretical justification for regularization. Gauss-Markov says OLS is optimal among unbiased linear estimators. But the class of all linear estimators (including biased ones like ridge) contains estimators with strictly lower MSE. This is why regularization is not just a heuristic. It is provably beneficial.

Failure Mode

The theorem guarantees existence of a good λ\lambda but does not tell you its value. In practice, λ\lambda must be chosen by cross-validation or marginal likelihood maximization. The optimal λ\lambda depends on the unknown ww^* and σ2\sigma^2, so no fixed formula works universally.

Bayesian Interpretation

Ridge regression is equivalent to MAP estimation with a Gaussian prior.

Under the model: wN(0,τ2I),yX,wN(Xw,σ2I)w \sim \mathcal{N}(0, \tau^2 I), \quad y \mid X, w \sim \mathcal{N}(Xw, \sigma^2 I)

The posterior is: wy,XN ⁣((XX+σ2τ2I)1Xy,  σ2(XX+σ2τ2I)1)w \mid y, X \sim \mathcal{N}\!\left((X^\top X + \frac{\sigma^2}{\tau^2} I)^{-1} X^\top y, \;\sigma^2(X^\top X + \frac{\sigma^2}{\tau^2} I)^{-1}\right)

The posterior mean (= MAP estimate, since the posterior is Gaussian) equals w^ridge\hat{w}_{\text{ridge}} with λ=σ2/τ2\lambda = \sigma^2/\tau^2.

Interpretation: Large τ2\tau^2 means a weak prior (small λ\lambda, close to OLS). Small τ2\tau^2 means a strong prior pulling ww toward zero (large λ\lambda, heavy shrinkage).

Choosing Lambda

Definition

Cross-Validation for Ridge

The standard approach: for each candidate λ\lambda in a grid, compute the leave-one-out (LOO) or KK-fold cross-validation error.

LOO shortcut: For ridge regression, the LOO error has a closed form:

CVLOO=1ni=1n(yiy^i(λ)1hii(λ))2\text{CV}_{\text{LOO}} = \frac{1}{n} \sum_{i=1}^n \left(\frac{y_i - \hat{y}_i(\lambda)}{1 - h_{ii}(\lambda)}\right)^2

where hii(λ)h_{ii}(\lambda) is the ii-th diagonal of the ridge hat matrix Hλ=X(XX+λI)1XH_\lambda = X(X^\top X + \lambda I)^{-1} X^\top. This avoids refitting nn times.

Canonical Examples

Example

Ridge in the orthonormal design case

When XX=IX^\top X = I (orthonormal columns), OLS gives w^OLS=Xy\hat{w}_{\text{OLS}} = X^\top y and ridge gives:

w^ridge=11+λXy=11+λw^OLS\hat{w}_{\text{ridge}} = \frac{1}{1 + \lambda} X^\top y = \frac{1}{1 + \lambda} \hat{w}_{\text{OLS}}

Each coefficient is multiplied by the same shrinkage factor 1/(1+λ)<11/(1 + \lambda) < 1. Ridge uniformly shrinks all coefficients toward zero. Compare this to lasso, which can set some coefficients exactly to zero.

Example

Ridge fixes multicollinearity

Suppose two features are nearly identical: x1x2x_1 \approx x_2. Then XXX^\top X has a near-zero eigenvalue, and OLS assigns wildly different (large, opposite-signed) coefficients to x1x_1 and x2x_2.

Ridge adds λ\lambda to every eigenvalue of XXX^\top X, stabilizing the inversion. The ridge coefficients for x1x_1 and x2x_2 are moderate and similar in magnitude, reflecting the fact that the data cannot distinguish between them.

Common Confusions

Watch Out

Ridge does NOT do variable selection

Ridge shrinks all coefficients toward zero but never sets any exactly to zero. If you need sparse models (some coefficients exactly zero), you need the lasso (L1 penalty), not ridge. Ridge is appropriate when you believe all features contribute, but some contribute less.

Watch Out

Lambda is not the prior variance

The Bayesian connection gives λ=σ2/τ2\lambda = \sigma^2/\tau^2, not λ=τ2\lambda = \tau^2. Large τ2\tau^2 (diffuse prior, less regularization) corresponds to small λ\lambda. Students often confuse the direction.

Watch Out

Gauss-Markov does not contradict ridge being better

Gauss-Markov says OLS is the best linear unbiased estimator. Ridge is biased, so it falls outside the Gauss-Markov class. There is no contradiction in ridge having lower MSE than OLS. The theorem simply does not apply to biased estimators.

Summary

  • Ridge objective: minwyXw22+λw22\min_w \|y - Xw\|_2^2 + \lambda\|w\|_2^2
  • Closed form: w^=(XX+λI)1Xy\hat{w} = (X^\top X + \lambda I)^{-1} X^\top y
  • SVD: ridge shrinks direction jj by factor σj2/(σj2+λ)\sigma_j^2/(\sigma_j^2 + \lambda)
  • Always exists λ>0\lambda > 0 with lower MSE than OLS
  • Bayesian: ridge = Gaussian prior on ww with λ=σ2/τ2\lambda = \sigma^2/\tau^2
  • Does not do variable selection. all coefficients remain nonzero
  • Choose λ\lambda by cross-validation (LOO has a closed form shortcut)

Exercises

ExerciseCore

Problem

Show that for λ>0\lambda > 0, the matrix XX+λIX^\top X + \lambda I is always invertible, even when XX does not have full column rank.

ExerciseCore

Problem

Derive the bias of the ridge estimator. Show that E[w^ridge]=(XX+λI)1XXw\mathbb{E}[\hat{w}_{\text{ridge}}] = (X^\top X + \lambda I)^{-1} X^\top X\, w^* and therefore the bias is λ(XX+λI)1w-\lambda(X^\top X + \lambda I)^{-1} w^*.

ExerciseAdvanced

Problem

Show that the derivative of MSE(w^ridge)\text{MSE}(\hat{w}_{\text{ridge}}) with respect to λ\lambda is strictly negative at λ=0\lambda = 0 (assuming w0w^* \neq 0 and σ2>0\sigma^2 > 0). This proves that a small amount of ridge regularization always helps.

Related Comparisons

References

Canonical:

  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009), Chapter 3.4
  • Hoerl & Kennard, "Ridge Regression: Biased Estimation for Nonorthogonal Problems" (1970)

Current:

  • Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 11.3
  • Wainwright, High-Dimensional Statistics (2019), Chapter 7

Next Topics

Building on ridge regression:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics