Regression Methods
Ridge Regression
OLS with L2 regularization: closed-form shrinkage, bias-variance tradeoff, SVD interpretation, and the Bayesian connection to Gaussian priors.
Prerequisites
Why This Matters
OLS is unbiased, but unbiasedness is not always desirable. When features are correlated, when is large relative to , or when is ill-conditioned, OLS has high variance and often poor predictive performance. Ridge regression trades a small amount of bias for a large reduction in variance. And in terms of mean squared error, it can dominate OLS.
Ridge is the simplest regularization method, the one where everything has a closed form, and the entry point for understanding why regularization works.
Mental Model
OLS finds the minimum-norm solution to . Ridge adds a penalty that pulls toward zero: among all that explain the data reasonably well, pick the one with small . The penalty strength controls the tradeoff. more penalty means smaller weights (less variance) but worse fit to training data (more bias).
Geometrically: OLS projects onto the column space of . Ridge "shrinks" this projection, pulling slightly toward the origin.
Formal Setup
We have the linear model where , , .
Ridge Regression
The ridge estimator with regularization parameter minimizes:
The objective is strictly convex (for ), so the solution is unique. Setting the gradient to zero:
The closed-form solution is:
The matrix is always invertible for , even when is singular (e.g., when ).
SVD Interpretation
The SVD perspective is the cleanest way to understand what ridge does. Write the SVD of where , , , and .
Ridge Shrinkage Factors
In the SVD basis, the OLS fitted values are , a full projection onto each left singular direction. Ridge modifies this to:
The shrinkage factor for direction is:
This satisfies . Directions with large singular values () are barely shrunk: . Directions with small singular values () are heavily shrunk: .
In the coefficient space: The ridge estimate in the -basis is:
Compare to OLS: . Ridge replaces with , which is bounded even when is small. This is why ridge stabilizes ill-conditioned problems.
Bias-Variance Tradeoff
Ridge Bias and Variance
Under the model with :
Bias:
This is nonzero for . ridge is biased. The bias increases with .
Variance:
Compare to OLS variance: . The ridge variance is smaller in the positive semidefinite sense.
MSE: . For some , this is strictly less than .
Main Theorems
Ridge Dominates OLS in MSE
Statement
Under the linear model with and , there always exists a such that:
where .
More precisely, using the SVD, if are the singular values of and are the true coefficients in the SVD basis, then:
The first term is the squared bias for direction ; the second is the variance.
Intuition
OLS is BLUE (Gauss-Markov), but "best unbiased" does not mean "best overall." By allowing a small bias, ridge dramatically reduces variance. The MSE tradeoff is favorable because the bias grows linearly in (from zero) while the variance decreases. at , the derivative of MSE with respect to is always negative, so a small positive always helps.
Proof Sketch
Write where . The MSE decomposes as:
Diagonalize using the SVD to get the per-component formula. Differentiate with respect to at :
This derivative is strictly negative, so MSE decreases for small .
Why It Matters
This is the theoretical justification for regularization. Gauss-Markov says OLS is optimal among unbiased linear estimators. But the class of all linear estimators (including biased ones like ridge) contains estimators with strictly lower MSE. This is why regularization is not just a heuristic. It is provably beneficial.
Failure Mode
The theorem guarantees existence of a good but does not tell you its value. In practice, must be chosen by cross-validation or marginal likelihood maximization. The optimal depends on the unknown and , so no fixed formula works universally.
Bayesian Interpretation
Ridge regression is equivalent to MAP estimation with a Gaussian prior.
Under the model:
The posterior is:
The posterior mean (= MAP estimate, since the posterior is Gaussian) equals with .
Interpretation: Large means a weak prior (small , close to OLS). Small means a strong prior pulling toward zero (large , heavy shrinkage).
Choosing Lambda
Cross-Validation for Ridge
The standard approach: for each candidate in a grid, compute the leave-one-out (LOO) or -fold cross-validation error.
LOO shortcut: For ridge regression, the LOO error has a closed form:
where is the -th diagonal of the ridge hat matrix . This avoids refitting times.
Canonical Examples
Ridge in the orthonormal design case
When (orthonormal columns), OLS gives and ridge gives:
Each coefficient is multiplied by the same shrinkage factor . Ridge uniformly shrinks all coefficients toward zero. Compare this to lasso, which can set some coefficients exactly to zero.
Ridge fixes multicollinearity
Suppose two features are nearly identical: . Then has a near-zero eigenvalue, and OLS assigns wildly different (large, opposite-signed) coefficients to and .
Ridge adds to every eigenvalue of , stabilizing the inversion. The ridge coefficients for and are moderate and similar in magnitude, reflecting the fact that the data cannot distinguish between them.
Common Confusions
Ridge does NOT do variable selection
Ridge shrinks all coefficients toward zero but never sets any exactly to zero. If you need sparse models (some coefficients exactly zero), you need the lasso (L1 penalty), not ridge. Ridge is appropriate when you believe all features contribute, but some contribute less.
Lambda is not the prior variance
The Bayesian connection gives , not . Large (diffuse prior, less regularization) corresponds to small . Students often confuse the direction.
Gauss-Markov does not contradict ridge being better
Gauss-Markov says OLS is the best linear unbiased estimator. Ridge is biased, so it falls outside the Gauss-Markov class. There is no contradiction in ridge having lower MSE than OLS. The theorem simply does not apply to biased estimators.
Summary
- Ridge objective:
- Closed form:
- SVD: ridge shrinks direction by factor
- Always exists with lower MSE than OLS
- Bayesian: ridge = Gaussian prior on with
- Does not do variable selection. all coefficients remain nonzero
- Choose by cross-validation (LOO has a closed form shortcut)
Exercises
Problem
Show that for , the matrix is always invertible, even when does not have full column rank.
Problem
Derive the bias of the ridge estimator. Show that and therefore the bias is .
Problem
Show that the derivative of with respect to is strictly negative at (assuming and ). This proves that a small amount of ridge regularization always helps.
Related Comparisons
References
Canonical:
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009), Chapter 3.4
- Hoerl & Kennard, "Ridge Regression: Biased Estimation for Nonorthogonal Problems" (1970)
Current:
- Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 11.3
- Wainwright, High-Dimensional Statistics (2019), Chapter 7
Next Topics
Building on ridge regression:
- Lasso regression: L1 penalty for sparse variable selection
- Elastic net: combining L1 and L2 penalties
- Bias-variance tradeoff: the general principle at work
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Linear RegressionLayer 1
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Differentiation in RnLayer 0A
- Convex Optimization BasicsLayer 1
Builds on This
- Elastic NetLayer 2