Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

L1 vs. L2 Regularization

L1 (Lasso) penalizes the absolute value of weights, producing sparse solutions via the diamond geometry of the L1 ball. L2 (Ridge) penalizes squared weights, shrinking all coefficients toward zero without eliminating any. The choice depends on whether the true model is sparse or dense.

What Each Does

Both L1 and L2 regularization add a penalty to the loss function to prevent overfitting. Given a loss L(θ)\mathcal{L}(\theta) and parameter vector θRp\theta \in \mathbb{R}^p, the regularized objectives are:

L2 regularization (weight decay, Tikhonov, Ridge):

minθL(θ)+λθ22=minθL(θ)+λj=1pθj2\min_{\theta} \mathcal{L}(\theta) + \lambda \|\theta\|_2^2 = \min_{\theta} \mathcal{L}(\theta) + \lambda \sum_{j=1}^p \theta_j^2

L1 regularization (Lasso):

minθL(θ)+λθ1=minθL(θ)+λj=1pθj\min_{\theta} \mathcal{L}(\theta) + \lambda \|\theta\|_1 = \min_{\theta} \mathcal{L}(\theta) + \lambda \sum_{j=1}^p |\theta_j|

The difference is a single exponent: squaring versus absolute value. That difference controls whether the solution is sparse.

Why L1 Produces Zeros: The Geometry

The constraint set {θ:θ1t}\{\theta : \|\theta\|_1 \leq t\} is a diamond (cross-polytope) with corners on the coordinate axes. The constraint set {θ:θ22t}\{\theta : \|\theta\|_2^2 \leq t\} is a smooth sphere. The loss function defines elliptical contours of constant value. The regularized optimum is where these contours first touch the constraint set.

A smooth ellipse generically touches a smooth sphere at a point where no coordinate is zero. But a smooth ellipse touches a diamond at a corner, where one or more coordinates are exactly zero. This is the geometric explanation for L1 sparsity: the corners of the L1 ball are attractors for the constrained optimum.

The KKT Condition and Soft Thresholding

For the L1-regularized least squares problem with design matrix XX and response yy, the optimality condition at coordinate jj involves the subgradient because θj|\theta_j| is not differentiable at zero. The KKT condition is:

θ^j=sign(zj)max(zjλ,0)\hat{\theta}_j = \text{sign}(z_j) \max(|z_j| - \lambda, 0)

where zjz_j is the partial residual (the OLS solution for coordinate jj holding all others fixed). This is the soft thresholding operator Sλ(zj)S_\lambda(z_j). If zjλ|z_j| \leq \lambda, the coefficient is set exactly to zero. If zj>λ|z_j| > \lambda, it is shrunk toward zero by λ\lambda.

For L2, the analogous update is:

θ^j=zj1+2λ\hat{\theta}_j = \frac{z_j}{1 + 2\lambda}

L2 shrinks by a multiplicative factor. It never reaches zero (dividing by 1+2λ1 + 2\lambda gives a smaller number, never zero). L1 shrinks by subtraction: it can subtract enough to reach zero.

Side-by-Side Comparison

PropertyL1 (Lasso)L2 (Ridge)
Penaltyλθj\lambda \sum \lvert\theta_j\rvertλθj2\lambda \sum \theta_j^2
Constraint geometryDiamond (corners on axes)Sphere (smooth, no corners)
Produces exact zerosYes, when zjλ\lvert z_j \rvert \leq \lambdaNo, coefficients shrink but stay nonzero
Shrinkage typeSubtractive (soft thresholding)Multiplicative (proportional shrinkage)
Closed-form (linear)No, requires iterative solverYes: (XTX+λI)1XTy(X^TX + \lambda I)^{-1}X^Ty
SolverCoordinate descent, ISTA, ADMMMatrix inverse, SVD, gradient descent
Bayesian priorLaplace: p(θj)eθj/bp(\theta_j) \propto e^{-\lvert\theta_j\rvert/b}Gaussian: p(θj)eθj2/2τ2p(\theta_j) \propto e^{-\theta_j^2/2\tau^2}
Best whenTrue model is sparse (sps \ll p)True model is dense (all features contribute)
MulticollinearityPicks one from correlated groupDistributes weight across correlated group
Convergence rateO(slogp/n)O(\sqrt{s \log p / n})O(p/n)O(\sqrt{p / n})
DifferentiableNo (not at θj=0\theta_j = 0)Yes, everywhere

When Each Wins

L1 wins: sparse high-dimensional problems

When pnp \gg n and only sps \ll p features are relevant, L1 is statistically optimal. The Lasso achieves the minimax rate θ^θ2=O(slogp/n)\|\hat{\theta} - \theta^*\|_2 = O(\sqrt{s \log p / n}) under restricted eigenvalue conditions. L1 also provides automatic variable selection: the zero coefficients tell you which features are irrelevant.

L2 wins: dense models with multicollinearity

When all features contribute with small, roughly equal coefficients, Ridge is preferred. L2 distributes weight across correlated features rather than arbitrarily selecting one. Ridge also has a closed-form solution, making it computationally cheaper. The rate O(p/n)O(\sqrt{p/n}) is better than Lasso's rate when ss is close to pp.

L2 wins: deep learning weight decay

In neural networks, L2 regularization (weight decay) is standard. Sparsity at the individual weight level is less meaningful when the model has millions of parameters and features are learned, not given. AdamW decouples weight decay from the adaptive gradient, and this L2-style penalty helps control the magnitude of weights without forcing exact zeros.

L1 wins: interpretability requirements

A model with 12 nonzero features out of 5,000 is immediately interpretable. A model with 5,000 small nonzero coefficients is not. In genomics, clinical, and scientific applications, knowing which features matter is often the primary goal.

Solution Uniqueness

The L2-regularized objective is strictly convex (the Hessian is XTX+2λIX^TX + 2\lambda I, which is positive definite for λ>0\lambda > 0). The solution is always unique.

The L1-regularized objective is convex but not strictly convex. When features are highly correlated, multiple solutions can achieve the same objective value. The Lasso solution is unique only when XX satisfies certain conditions (e.g., the columns are in general position). This non-uniqueness with correlated features is a practical limitation: which feature Lasso selects from a correlated group can be unstable across bootstrap samples.

The Elastic Net Bridge

Definition

Elastic Net

The elastic net combines both penalties with mixing parameter α[0,1]\alpha \in [0, 1]:

minθL(θ)+λ(αθ1+(1α)θ22)\min_{\theta} \mathcal{L}(\theta) + \lambda\left(\alpha \|\theta\|_1 + (1 - \alpha) \|\theta\|_2^2\right)

When α=1\alpha = 1, this is Lasso. When α=0\alpha = 0, this is Ridge. The L2 component makes the objective strictly convex (unique solution) while the L1 component maintains sparsity. Elastic net selects groups of correlated features together, combining the advantages of both.

Common Confusions

Watch Out

L1 regularization is not the same as L1 loss

L1 regularization penalizes the coefficients: λθ1\lambda \|\theta\|_1. L1 loss penalizes the residuals: yiy^i\sum |y_i - \hat{y}_i|. Lasso uses L1 regularization with L2 (squared) loss. Least absolute deviations (LAD) regression uses L1 loss with no regularization. These are completely different methods.

Watch Out

L2 regularization is not identical to weight decay in all optimizers

For SGD, L2 regularization and weight decay are equivalent. For adaptive optimizers like Adam, they differ. L2 regularization adds λθ\lambda \theta to the gradient before the adaptive scaling. Weight decay subtracts λθ\lambda \theta from the parameters after the update. AdamW implements decoupled weight decay, which is the correct formulation for adaptive methods.

Watch Out

Sparsity requires sufficient regularization strength

L1 does not always produce sparse solutions. When λ\lambda is very small, the Lasso solution approaches OLS and all coefficients are nonzero. Sparsity increases with λ\lambda. The regularization path (from large to small λ\lambda) reveals features entering one at a time, which is the basis of the LARS algorithm.

Watch Out

The Bayesian MAP interpretation has limits

Ridge corresponds to a Gaussian prior and Lasso to a Laplace prior, but only for the MAP point estimate. The full Bayesian posterior under a Laplace prior is not sparse. If you want truly sparse Bayesian inference, use spike-and-slab priors or the horseshoe prior, not just a Laplace prior.

References

  1. Tibshirani, R. (1996). "Regression shrinkage and selection via the Lasso." Journal of the Royal Statistical Society: Series B, 58(1), 267-288.
  2. Hoerl, A. E. and Kennard, R. W. (1970). "Ridge regression: biased estimation for nonorthogonal problems." Technometrics, 12(1), 55-67.
  3. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapters 3.4 (Ridge, Lasso) and 3.8 (Elastic Net).
  4. Zou, H. and Hastie, T. (2005). "Regularization and variable selection via the elastic net." Journal of the Royal Statistical Society: Series B, 67(2), 301-320.
  5. Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Chapter 7 (Lasso bounds under restricted eigenvalue conditions).
  6. Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. (AdamW and the distinction between L2 and weight decay.)
  7. Bach, F. (2012). "Optimization with sparsity-inducing penalties." Foundations and Trends in Machine Learning, 4(1), 1-106.