What Each Does
Both L1 and L2 regularization add a penalty to the loss function to prevent overfitting. Given a loss and parameter vector , the regularized objectives are:
L2 regularization (weight decay, Tikhonov, Ridge):
L1 regularization (Lasso):
The difference is a single exponent: squaring versus absolute value. That difference controls whether the solution is sparse.
Why L1 Produces Zeros: The Geometry
The constraint set is a diamond (cross-polytope) with corners on the coordinate axes. The constraint set is a smooth sphere. The loss function defines elliptical contours of constant value. The regularized optimum is where these contours first touch the constraint set.
A smooth ellipse generically touches a smooth sphere at a point where no coordinate is zero. But a smooth ellipse touches a diamond at a corner, where one or more coordinates are exactly zero. This is the geometric explanation for L1 sparsity: the corners of the L1 ball are attractors for the constrained optimum.
The KKT Condition and Soft Thresholding
For the L1-regularized least squares problem with design matrix and response , the optimality condition at coordinate involves the subgradient because is not differentiable at zero. The KKT condition is:
where is the partial residual (the OLS solution for coordinate holding all others fixed). This is the soft thresholding operator . If , the coefficient is set exactly to zero. If , it is shrunk toward zero by .
For L2, the analogous update is:
L2 shrinks by a multiplicative factor. It never reaches zero (dividing by gives a smaller number, never zero). L1 shrinks by subtraction: it can subtract enough to reach zero.
Side-by-Side Comparison
| Property | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Penalty | ||
| Constraint geometry | Diamond (corners on axes) | Sphere (smooth, no corners) |
| Produces exact zeros | Yes, when | No, coefficients shrink but stay nonzero |
| Shrinkage type | Subtractive (soft thresholding) | Multiplicative (proportional shrinkage) |
| Closed-form (linear) | No, requires iterative solver | Yes: |
| Solver | Coordinate descent, ISTA, ADMM | Matrix inverse, SVD, gradient descent |
| Bayesian prior | Laplace: | Gaussian: |
| Best when | True model is sparse () | True model is dense (all features contribute) |
| Multicollinearity | Picks one from correlated group | Distributes weight across correlated group |
| Convergence rate | ||
| Differentiable | No (not at ) | Yes, everywhere |
When Each Wins
L1 wins: sparse high-dimensional problems
When and only features are relevant, L1 is statistically optimal. The Lasso achieves the minimax rate under restricted eigenvalue conditions. L1 also provides automatic variable selection: the zero coefficients tell you which features are irrelevant.
L2 wins: dense models with multicollinearity
When all features contribute with small, roughly equal coefficients, Ridge is preferred. L2 distributes weight across correlated features rather than arbitrarily selecting one. Ridge also has a closed-form solution, making it computationally cheaper. The rate is better than Lasso's rate when is close to .
L2 wins: deep learning weight decay
In neural networks, L2 regularization (weight decay) is standard. Sparsity at the individual weight level is less meaningful when the model has millions of parameters and features are learned, not given. AdamW decouples weight decay from the adaptive gradient, and this L2-style penalty helps control the magnitude of weights without forcing exact zeros.
L1 wins: interpretability requirements
A model with 12 nonzero features out of 5,000 is immediately interpretable. A model with 5,000 small nonzero coefficients is not. In genomics, clinical, and scientific applications, knowing which features matter is often the primary goal.
Solution Uniqueness
The L2-regularized objective is strictly convex (the Hessian is , which is positive definite for ). The solution is always unique.
The L1-regularized objective is convex but not strictly convex. When features are highly correlated, multiple solutions can achieve the same objective value. The Lasso solution is unique only when satisfies certain conditions (e.g., the columns are in general position). This non-uniqueness with correlated features is a practical limitation: which feature Lasso selects from a correlated group can be unstable across bootstrap samples.
The Elastic Net Bridge
Elastic Net
The elastic net combines both penalties with mixing parameter :
When , this is Lasso. When , this is Ridge. The L2 component makes the objective strictly convex (unique solution) while the L1 component maintains sparsity. Elastic net selects groups of correlated features together, combining the advantages of both.
Common Confusions
L1 regularization is not the same as L1 loss
L1 regularization penalizes the coefficients: . L1 loss penalizes the residuals: . Lasso uses L1 regularization with L2 (squared) loss. Least absolute deviations (LAD) regression uses L1 loss with no regularization. These are completely different methods.
L2 regularization is not identical to weight decay in all optimizers
For SGD, L2 regularization and weight decay are equivalent. For adaptive optimizers like Adam, they differ. L2 regularization adds to the gradient before the adaptive scaling. Weight decay subtracts from the parameters after the update. AdamW implements decoupled weight decay, which is the correct formulation for adaptive methods.
Sparsity requires sufficient regularization strength
L1 does not always produce sparse solutions. When is very small, the Lasso solution approaches OLS and all coefficients are nonzero. Sparsity increases with . The regularization path (from large to small ) reveals features entering one at a time, which is the basis of the LARS algorithm.
The Bayesian MAP interpretation has limits
Ridge corresponds to a Gaussian prior and Lasso to a Laplace prior, but only for the MAP point estimate. The full Bayesian posterior under a Laplace prior is not sparse. If you want truly sparse Bayesian inference, use spike-and-slab priors or the horseshoe prior, not just a Laplace prior.
References
- Tibshirani, R. (1996). "Regression shrinkage and selection via the Lasso." Journal of the Royal Statistical Society: Series B, 58(1), 267-288.
- Hoerl, A. E. and Kennard, R. W. (1970). "Ridge regression: biased estimation for nonorthogonal problems." Technometrics, 12(1), 55-67.
- Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapters 3.4 (Ridge, Lasso) and 3.8 (Elastic Net).
- Zou, H. and Hastie, T. (2005). "Regularization and variable selection via the elastic net." Journal of the Royal Statistical Society: Series B, 67(2), 301-320.
- Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Chapter 7 (Lasso bounds under restricted eigenvalue conditions).
- Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. (AdamW and the distinction between L2 and weight decay.)
- Bach, F. (2012). "Optimization with sparsity-inducing penalties." Foundations and Trends in Machine Learning, 4(1), 1-106.