Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Ridge vs. Lasso Regression

L2 penalty shrinks all coefficients toward zero; L1 penalty drives some exactly to zero. Ridge has a closed-form solution and handles multicollinearity; Lasso performs variable selection but requires iterative solvers.

What Each Measures

Both Ridge and Lasso are regularized linear regression methods that add a penalty to the ordinary least squares (OLS) objective to prevent overfitting. They differ in the shape of the penalty, and that geometric difference produces different behavior.

Ridge adds the squared L2 norm of the coefficient vector:

β^ridge=argminβ{yXβ22+λβ22}\hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2 \right\}

Lasso adds the L1 norm of the coefficient vector:

β^lasso=argminβ{yXβ22+λβ1}\hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda \|\beta\|_1 \right\}

Both control model complexity through λ0\lambda \geq 0. The question is how they control it.

Side-by-Side Statement

Definition

Ridge Regression (Tikhonov Regularization)

The Ridge estimator has a closed-form solution:

β^ridge=(XTX+λI)1XTy\hat{\beta}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y

Adding λI\lambda I to XTXX^T X makes the matrix invertible even when XX is rank-deficient. Every coefficient is shrunk toward zero, but none is set exactly to zero.

Definition

Lasso Regression (Least Absolute Shrinkage and Selection Operator)

The Lasso objective is:

β^lasso=argminβ12nyXβ22+λβ1\hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \frac{1}{2n}\|y - X\beta\|_2^2 + \lambda \|\beta\|_1

There is no closed-form solution (the L1 norm is not differentiable at zero). Lasso must be solved iteratively, typically via coordinate descent or proximal gradient methods. Crucially, Lasso can set coefficients exactly to zero, performing automatic variable selection.

The Geometry: Why the Shape Matters

The key insight is geometric. Regularization constrains β\beta to lie within a ball: β22t\|\beta\|_2^2 \leq t for Ridge, β1t\|\beta\|_1 \leq t for Lasso. The OLS solution defines elliptical contours of constant loss.

This is the geometric explanation for Lasso sparsity: corners of the L1 ball are attractors for the constrained optimum. The L2 ball has no corners, so Ridge almost never produces exact zeros.

Where Each Is Stronger

Ridge wins when all features contribute

If the true model uses all pp features with small, roughly equal coefficients, Ridge is the natural choice. It shrinks everything proportionally without discarding any feature. Ridge also handles multicollinearity gracefully: correlated features get roughly equal, dampened coefficients rather than wildly oscillating ones.

Lasso wins when the true model is sparse

If only sps \ll p features are truly relevant, Lasso can identify them by zeroing out the rest. In high-dimensional settings (pnp \gg n), Lasso's sparsity is not just convenient; it is statistically necessary. The Lasso achieves the rate β^β2=O(slogp/n)\|\hat{\beta} - \beta^*\|_2 = O(\sqrt{s \log p / n}) under restricted isometry or compatibility conditions.

Ridge wins on computational simplicity

Ridge has a closed-form solution requiring only a matrix inversion (or SVD decomposition). Lasso requires iterative algorithms: coordinate descent, ISTA/FISTA, or ADMM. For very large pp, the iterative cost of Lasso can be substantial.

Lasso wins on interpretability

A Lasso model with 8 nonzero features out of 500 is immediately interpretable. A Ridge model with 500 small but nonzero coefficients is not. In scientific applications where you want to know which features matter, Lasso is preferred.

Where Each Fails

Ridge fails at variable selection

Ridge never sets coefficients to zero. If the true model is sparse and you need to identify relevant features, Ridge cannot do it. You can threshold small Ridge coefficients post-hoc, but this is ad hoc and lacks the theoretical guarantees of Lasso.

Lasso fails with groups of correlated features

When features are highly correlated, Lasso tends to select one feature from the group and zero out the rest. Which feature it selects can be unstable across samples. This is the "Lasso instability" problem. Ridge, by contrast, gives roughly equal weight to correlated features.

Lasso fails when pp is large and all features matter

If the true coefficient vector is dense (many small nonzero entries), Lasso's bias from L1 shrinkage can be severe. The L1 penalty shrinks large coefficients less than small ones (in relative terms), but it shrinks all coefficients, and the bias does not vanish as fast as Ridge's in the dense regime.

Key Assumptions That Differ

RidgeLasso
Penaltyλβ22\lambda\|\beta\|_2^2λβ1\lambda\|\beta\|_1
GeometrySmooth sphereDiamond with corners
SparsityNo exact zerosExact zeros
Closed formYes: (XTX+λI)1XTy(X^TX + \lambda I)^{-1}X^TyNo, iterative solver
MulticollinearityHandles well (groups get equal weight)Handles poorly (picks one from group)
Assumption on β\beta^*Small norm (dense)Sparse (few nonzero)
Rate (high-dim)O(p/n)O(\sqrt{p/n})O(slogp/n)O(\sqrt{s\log p/n})
SolverMatrix inverse / SVDCoordinate descent / proximal gradient

The Compromise: Elastic Net

Definition

Elastic Net

The elastic net combines both penalties:

β^EN=argminβ{yXβ22+λ1β1+λ2β22}\hat{\beta}_{\text{EN}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2 \right\}

Equivalently, with mixing parameter α[0,1]\alpha \in [0, 1]:

β^EN=argminβ{yXβ22+λ(αβ1+(1α)β22)}\hat{\beta}_{\text{EN}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda\left(\alpha\|\beta\|_1 + (1 - \alpha)\|\beta\|_2^2\right) \right\}

When α=1\alpha = 1, this is Lasso. When α=0\alpha = 0, this is Ridge.

Elastic net inherits Lasso's variable selection while also handling correlated features like Ridge. When features come in groups, elastic net tends to select the whole group rather than one representative. The cost is an additional hyperparameter to tune.

What to Memorize

  1. Ridge = L2 penalty = smooth ball = shrinks all, zeros none = closed form
  2. Lasso = L1 penalty = diamond = shrinks and zeros = iterative solver
  3. Geometry: L1 ball has corners on axes; loss contours hit corners, producing sparsity
  4. Ridge rate: O(p/n)O(\sqrt{p/n}), all features matter
  5. Lasso rate: O(slogp/n)O(\sqrt{s \log p / n}), only ss features matter
  6. Elastic net: both penalties, gets grouping + sparsity

When a Researcher Would Use Each

Example

Genomics: finding relevant genes

You have n=200n = 200 patients and p=20,000p = 20{,}000 gene expression features. The true signal likely involves a small number of genes. Use Lasso (or elastic net) to select the relevant genes. Ridge would give you 20,000 small coefficients with no guidance on which genes matter.

Example

Time series forecasting with many lags

You include 50 lagged features in a regression. All lags contribute some predictive power, and successive lags are highly correlated. Use Ridge to stabilize the estimates without discarding any lag. Lasso would arbitrarily drop some lags from correlated groups.

Example

Mixed relevance with correlated groups

You have gene expression data where genes come in co-regulated clusters, but only some clusters are relevant. Use elastic net with moderate α\alpha (say 0.50.5) to select relevant clusters while keeping within-cluster stability.

Common Confusions

Watch Out

Lasso does not always produce sparse solutions

Lasso produces sparse solutions only when λ\lambda is large enough. As λ0\lambda \to 0, the Lasso solution approaches OLS and all coefficients are nonzero. The regularization path from large to small λ\lambda is what produces the variable selection: features enter the model one by one as λ\lambda decreases. Understanding the regularization path (via LARS or coordinate descent) is essential for using Lasso in practice.

Watch Out

The Bayesian interpretation differs

Ridge corresponds to a Gaussian prior on β\beta: βjN(0,τ2)\beta_j \sim \mathcal{N}(0, \tau^2). Lasso corresponds to a Laplace prior: βjLaplace(0,b)\beta_j \sim \text{Laplace}(0, b). The Laplace prior has heavier tails and a sharper peak at zero, which is why it encourages sparsity. But the Bayesian MAP estimate under a Laplace prior is Lasso only for the point estimate. The full Bayesian posterior is not sparse.

Watch Out

L1 regularization does not mean L1 loss

Lasso uses L1 penalty on the coefficients but L2 (squared) loss on the residuals. Do not confuse this with least absolute deviations (LAD) regression, which uses L1 loss on residuals but no penalty on coefficients.