What Each Measures
Both Ridge and Lasso are regularized linear regression methods that add a penalty to the ordinary least squares (OLS) objective to prevent overfitting. They differ in the shape of the penalty, and that geometric difference produces different behavior.
Ridge adds the squared L2 norm of the coefficient vector:
Lasso adds the L1 norm of the coefficient vector:
Both control model complexity through . The question is how they control it.
Side-by-Side Statement
Ridge Regression (Tikhonov Regularization)
The Ridge estimator has a closed-form solution:
Adding to makes the matrix invertible even when is rank-deficient. Every coefficient is shrunk toward zero, but none is set exactly to zero.
Lasso Regression (Least Absolute Shrinkage and Selection Operator)
The Lasso objective is:
There is no closed-form solution (the L1 norm is not differentiable at zero). Lasso must be solved iteratively, typically via coordinate descent or proximal gradient methods. Crucially, Lasso can set coefficients exactly to zero, performing automatic variable selection.
The Geometry: Why the Shape Matters
The key insight is geometric. Regularization constrains to lie within a ball: for Ridge, for Lasso. The OLS solution defines elliptical contours of constant loss.
- The L2 ball is a smooth sphere. The loss contours will generically touch it at a point where no coordinate is zero.
- The L1 ball is a diamond (cross-polytope) with corners on the coordinate axes. The loss contours are much more likely to touch the diamond at a corner, which means one or more coordinates are exactly zero.
This is the geometric explanation for Lasso sparsity: corners of the L1 ball are attractors for the constrained optimum. The L2 ball has no corners, so Ridge almost never produces exact zeros.
Where Each Is Stronger
Ridge wins when all features contribute
If the true model uses all features with small, roughly equal coefficients, Ridge is the natural choice. It shrinks everything proportionally without discarding any feature. Ridge also handles multicollinearity gracefully: correlated features get roughly equal, dampened coefficients rather than wildly oscillating ones.
Lasso wins when the true model is sparse
If only features are truly relevant, Lasso can identify them by zeroing out the rest. In high-dimensional settings (), Lasso's sparsity is not just convenient; it is statistically necessary. The Lasso achieves the rate under restricted isometry or compatibility conditions.
Ridge wins on computational simplicity
Ridge has a closed-form solution requiring only a matrix inversion (or SVD decomposition). Lasso requires iterative algorithms: coordinate descent, ISTA/FISTA, or ADMM. For very large , the iterative cost of Lasso can be substantial.
Lasso wins on interpretability
A Lasso model with 8 nonzero features out of 500 is immediately interpretable. A Ridge model with 500 small but nonzero coefficients is not. In scientific applications where you want to know which features matter, Lasso is preferred.
Where Each Fails
Ridge fails at variable selection
Ridge never sets coefficients to zero. If the true model is sparse and you need to identify relevant features, Ridge cannot do it. You can threshold small Ridge coefficients post-hoc, but this is ad hoc and lacks the theoretical guarantees of Lasso.
Lasso fails with groups of correlated features
When features are highly correlated, Lasso tends to select one feature from the group and zero out the rest. Which feature it selects can be unstable across samples. This is the "Lasso instability" problem. Ridge, by contrast, gives roughly equal weight to correlated features.
Lasso fails when is large and all features matter
If the true coefficient vector is dense (many small nonzero entries), Lasso's bias from L1 shrinkage can be severe. The L1 penalty shrinks large coefficients less than small ones (in relative terms), but it shrinks all coefficients, and the bias does not vanish as fast as Ridge's in the dense regime.
Key Assumptions That Differ
| Ridge | Lasso | |
|---|---|---|
| Penalty | ||
| Geometry | Smooth sphere | Diamond with corners |
| Sparsity | No exact zeros | Exact zeros |
| Closed form | Yes: | No, iterative solver |
| Multicollinearity | Handles well (groups get equal weight) | Handles poorly (picks one from group) |
| Assumption on | Small norm (dense) | Sparse (few nonzero) |
| Rate (high-dim) | ||
| Solver | Matrix inverse / SVD | Coordinate descent / proximal gradient |
The Compromise: Elastic Net
Elastic Net
The elastic net combines both penalties:
Equivalently, with mixing parameter :
When , this is Lasso. When , this is Ridge.
Elastic net inherits Lasso's variable selection while also handling correlated features like Ridge. When features come in groups, elastic net tends to select the whole group rather than one representative. The cost is an additional hyperparameter to tune.
What to Memorize
- Ridge = L2 penalty = smooth ball = shrinks all, zeros none = closed form
- Lasso = L1 penalty = diamond = shrinks and zeros = iterative solver
- Geometry: L1 ball has corners on axes; loss contours hit corners, producing sparsity
- Ridge rate: , all features matter
- Lasso rate: , only features matter
- Elastic net: both penalties, gets grouping + sparsity
When a Researcher Would Use Each
Genomics: finding relevant genes
You have patients and gene expression features. The true signal likely involves a small number of genes. Use Lasso (or elastic net) to select the relevant genes. Ridge would give you 20,000 small coefficients with no guidance on which genes matter.
Time series forecasting with many lags
You include 50 lagged features in a regression. All lags contribute some predictive power, and successive lags are highly correlated. Use Ridge to stabilize the estimates without discarding any lag. Lasso would arbitrarily drop some lags from correlated groups.
Mixed relevance with correlated groups
You have gene expression data where genes come in co-regulated clusters, but only some clusters are relevant. Use elastic net with moderate (say ) to select relevant clusters while keeping within-cluster stability.
Common Confusions
Lasso does not always produce sparse solutions
Lasso produces sparse solutions only when is large enough. As , the Lasso solution approaches OLS and all coefficients are nonzero. The regularization path from large to small is what produces the variable selection: features enter the model one by one as decreases. Understanding the regularization path (via LARS or coordinate descent) is essential for using Lasso in practice.
The Bayesian interpretation differs
Ridge corresponds to a Gaussian prior on : . Lasso corresponds to a Laplace prior: . The Laplace prior has heavier tails and a sharper peak at zero, which is why it encourages sparsity. But the Bayesian MAP estimate under a Laplace prior is Lasso only for the point estimate. The full Bayesian posterior is not sparse.
L1 regularization does not mean L1 loss
Lasso uses L1 penalty on the coefficients but L2 (squared) loss on the residuals. Do not confuse this with least absolute deviations (LAD) regression, which uses L1 loss on residuals but no penalty on coefficients.