Ridge vs. Lasso Regression. L2 vs. L1 Regularization Compared

What Each Measures

Both Ridge and Lasso are regularized linear regression methods that add a penalty to the ordinary least squares (OLS) objective to prevent overfitting. They differ in the shape of the penalty, and that geometric difference produces different behavior.

Ridge adds the squared L2 norm of the coefficient vector:

$\hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2 \right\}$

Lasso adds the L1 norm of the coefficient vector:

$\hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda \|\beta\|_1 \right\}$

Both control model complexity through $\lambda \geq 0$ . The question is how they control it.

Side-by-Side Statement

Definition

Ridge Regression (Tikhonov Regularization)

The Ridge estimator has a closed-form solution:

$\hat{\beta}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y$

Adding $\lambda I$ to $X^T X$ makes the matrix invertible even when $X$ is rank-deficient. Every coefficient is shrunk toward zero, but none is set exactly to zero.

Definition

Lasso Regression (Least Absolute Shrinkage and Selection Operator)

The Lasso objective is:

$\hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \frac{1}{2n}\|y - X\beta\|_2^2 + \lambda \|\beta\|_1$

There is no closed-form solution (the L1 norm is not differentiable at zero). Lasso must be solved iteratively, typically via coordinate descent or proximal gradient methods. Crucially, Lasso can set coefficients exactly to zero, performing automatic variable selection.

The Geometry: Why the Shape Matters

The key insight is geometric. Regularization constrains $\beta$ to lie within a ball: $\|\beta\|_2^2 \leq t$ for Ridge, $\|\beta\|_1 \leq t$ for Lasso. The OLS solution defines elliptical contours of constant loss.

The L2 ball is a smooth sphere. The loss contours will generically touch it at a point where no coordinate is zero.
The L1 ball is a diamond (cross-polytope) with corners on the coordinate axes. The loss contours are much more likely to touch the diamond at a corner, which means one or more coordinates are exactly zero.

This is the geometric explanation for Lasso sparsity: corners of the L1 ball are attractors for the constrained optimum. The L2 ball has no corners, so Ridge almost never produces exact zeros.

Where Each Is Stronger

Ridge wins when all features contribute

If the true model uses all $p$ features with small, roughly equal coefficients, Ridge is the natural choice. It shrinks everything proportionally without discarding any feature. Ridge also handles multicollinearity gracefully: correlated features get roughly equal, dampened coefficients rather than wildly oscillating ones.

Lasso wins when the true model is sparse

If only $s \ll p$ features are truly relevant, Lasso can identify them by zeroing out the rest. In high-dimensional settings ( $p \gg n$ ), Lasso's sparsity is not just convenient; it is statistically necessary. The Lasso achieves the rate $\|\hat{\beta} - \beta^*\|_2 = O(\sqrt{s \log p / n})$ under restricted isometry or compatibility conditions.

Ridge wins on computational simplicity

Ridge has a closed-form solution requiring only a matrix inversion (or SVD decomposition). Lasso requires iterative algorithms: coordinate descent, ISTA/FISTA, or ADMM. For very large $p$ , the iterative cost of Lasso can be substantial.

Lasso wins on interpretability

A Lasso model with 8 nonzero features out of 500 is immediately interpretable. A Ridge model with 500 small but nonzero coefficients is not. In scientific applications where you want to know which features matter, Lasso is preferred.

Where Each Fails

Ridge fails at variable selection

Ridge never sets coefficients to zero. If the true model is sparse and you need to identify relevant features, Ridge cannot do it. You can threshold small Ridge coefficients post-hoc, but this is ad hoc and lacks the theoretical guarantees of Lasso.

Lasso fails with groups of correlated features

When features are highly correlated, Lasso tends to select one feature from the group and zero out the rest. Which feature it selects can be unstable across samples. This is the "Lasso instability" problem. Ridge, by contrast, gives roughly equal weight to correlated features.

Lasso fails when $p$ is large and all features matter

If the true coefficient vector is dense (many small nonzero entries), Lasso's bias from L1 shrinkage can be severe. The L1 penalty shrinks large coefficients less than small ones (in relative terms), but it shrinks all coefficients, and the bias does not vanish as fast as Ridge's in the dense regime.

Key Assumptions That Differ

	Ridge	Lasso
Penalty	$\lambda\\|\beta\\|_2^2$	$\lambda\\|\beta\\|_1$
Geometry	Smooth sphere	Diamond with corners
Sparsity	No exact zeros	Exact zeros
Closed form	Yes: $(X^TX + \lambda I)^{-1}X^Ty$	No, iterative solver
Multicollinearity	Handles well (groups get equal weight)	Handles poorly (picks one from group)
*Assumption on $\beta^$**	Small norm (dense)	Sparse (few nonzero)
Rate (high-dim)	$O(\sqrt{p/n})$	$O(\sqrt{s\log p/n})$
Solver	Matrix inverse / SVD	Coordinate descent / proximal gradient

The Compromise: Elastic Net

Definition

Elastic Net

The elastic net combines both penalties:

$\hat{\beta}_{\text{EN}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2 \right\}$

Equivalently, with mixing parameter $\alpha \in [0, 1]$ :

$\hat{\beta}_{\text{EN}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda\left(\alpha\|\beta\|_1 + (1 - \alpha)\|\beta\|_2^2\right) \right\}$

When $\alpha = 1$ , this is Lasso. When $\alpha = 0$ , this is Ridge.

Elastic net inherits Lasso's variable selection while also handling correlated features like Ridge. When features come in groups, elastic net tends to select the whole group rather than one representative. The cost is an additional hyperparameter to tune.

What to Memorize

Ridge = L2 penalty = smooth ball = shrinks all, zeros none = closed form
Lasso = L1 penalty = diamond = shrinks and zeros = iterative solver
Geometry: L1 ball has corners on axes; loss contours hit corners, producing sparsity
Ridge rate: $O(\sqrt{p/n})$ , all features matter
Lasso rate: $O(\sqrt{s \log p / n})$ , only $s$ features matter
Elastic net: both penalties, gets grouping + sparsity

When a Researcher Would Use Each

Example

Genomics: finding relevant genes

You have $n = 200$ patients and $p = 20{,}000$ gene expression features. The true signal likely involves a small number of genes. Use Lasso (or elastic net) to select the relevant genes. Ridge would give you 20,000 small coefficients with no guidance on which genes matter.

Example

Time series forecasting with many lags

You include 50 lagged features in a regression. All lags contribute some predictive power, and successive lags are highly correlated. Use Ridge to stabilize the estimates without discarding any lag. Lasso would arbitrarily drop some lags from correlated groups.

Example

Mixed relevance with correlated groups

You have gene expression data where genes come in co-regulated clusters, but only some clusters are relevant. Use elastic net with moderate $\alpha$ (say $0.5$ ) to select relevant clusters while keeping within-cluster stability.

Common Confusions

Watch Out

Lasso does not always produce sparse solutions

Lasso produces sparse solutions only when $\lambda$ is large enough. As $\lambda \to 0$ , the Lasso solution approaches OLS and all coefficients are nonzero. The regularization path from large to small $\lambda$ is what produces the variable selection: features enter the model one by one as $\lambda$ decreases. Understanding the regularization path (via LARS or coordinate descent) is essential for using Lasso in practice.

Watch Out

The Bayesian interpretation differs

Ridge corresponds to a Gaussian prior on $\beta$ : $\beta_j \sim \mathcal{N}(0, \tau^2)$ . Lasso corresponds to a Laplace prior: $\beta_j \sim \text{Laplace}(0, b)$ . The Laplace prior has heavier tails and a sharper peak at zero, which is why it encourages sparsity. But the Bayesian MAP estimate under a Laplace prior is Lasso only for the point estimate. The full Bayesian posterior is not sparse.

Watch Out

L1 regularization does not mean L1 loss

Lasso uses L1 penalty on the coefficients but L2 (squared) loss on the residuals. Do not confuse this with least absolute deviations (LAD) regression, which uses L1 loss on residuals but no penalty on coefficients.

What Each Measures

Side-by-Side Statement

The Geometry: Why the Shape Matters

Where Each Is Stronger

Ridge wins when all features contribute

Lasso wins when the true model is sparse

Ridge wins on computational simplicity

Lasso wins on interpretability

Where Each Fails

Ridge fails at variable selection

Lasso fails with groups of correlated features

Lasso fails when ppp is large and all features matter

Key Assumptions That Differ

The Compromise: Elastic Net

What to Memorize

When a Researcher Would Use Each

Common Confusions

Lasso fails when $p$ is large and all features matter