L1 vs L2 Regularization: Sparsity, Geometry, and When to Use Each

What Each Does

Both L1 and L2 regularization add a penalty to the loss function to prevent overfitting. Given a loss $\mathcal{L}(\theta)$ and parameter vector $\theta \in \mathbb{R}^p$ , the regularized objectives are:

L2 regularization (weight decay, Tikhonov, Ridge):

$\min_{\theta} \mathcal{L}(\theta) + \lambda \|\theta\|_2^2 = \min_{\theta} \mathcal{L}(\theta) + \lambda \sum_{j=1}^p \theta_j^2$

L1 regularization (Lasso):

$\min_{\theta} \mathcal{L}(\theta) + \lambda \|\theta\|_1 = \min_{\theta} \mathcal{L}(\theta) + \lambda \sum_{j=1}^p |\theta_j|$

The difference is a single exponent: squaring versus absolute value. That difference controls whether the solution is sparse.

Why L1 Produces Zeros: The Geometry

The constraint set $\{\theta : \|\theta\|_1 \leq t\}$ is a diamond (cross-polytope) with corners on the coordinate axes. The constraint set $\{\theta : \|\theta\|_2^2 \leq t\}$ is a smooth sphere. The loss function defines elliptical contours of constant value. The regularized optimum is where these contours first touch the constraint set.

A smooth ellipse generically touches a smooth sphere at a point where no coordinate is zero. But a smooth ellipse touches a diamond at a corner, where one or more coordinates are exactly zero. This is the geometric explanation for L1 sparsity: the corners of the L1 ball are attractors for the constrained optimum.

The KKT Condition and Soft Thresholding

For the L1-regularized least squares problem with design matrix $X$ and response $y$ , the optimality condition at coordinate $j$ involves the subgradient because $|\theta_j|$ is not differentiable at zero. The KKT condition is:

$\hat{\theta}_j = \text{sign}(z_j) \max(|z_j| - \lambda, 0)$

where $z_j$ is the partial residual (the OLS solution for coordinate $j$ holding all others fixed). This is the soft thresholding operator $S_\lambda(z_j)$ . If $|z_j| \leq \lambda$ , the coefficient is set exactly to zero. If $|z_j| > \lambda$ , it is shrunk toward zero by $\lambda$ .

For L2, the analogous update is:

$\hat{\theta}_j = \frac{z_j}{1 + 2\lambda}$

L2 shrinks by a multiplicative factor. It never reaches zero (dividing by $1 + 2\lambda$ gives a smaller number, never zero). L1 shrinks by subtraction: it can subtract enough to reach zero.

Side-by-Side Comparison

Property	L1 (Lasso)	L2 (Ridge)
Penalty	$\lambda \sum \lvert\theta_j\rvert$	$\lambda \sum \theta_j^2$
Constraint geometry	Diamond (corners on axes)	Sphere (smooth, no corners)
Produces exact zeros	Yes, when $\lvert z_j \rvert \leq \lambda$	No, coefficients shrink but stay nonzero
Shrinkage type	Subtractive (soft thresholding)	Multiplicative (proportional shrinkage)
Closed-form (linear)	No, requires iterative solver	Yes: $(X^TX + \lambda I)^{-1}X^Ty$
Solver	Coordinate descent, ISTA, ADMM	Matrix inverse, SVD, gradient descent
Bayesian prior	Laplace: $p(\theta_j) \propto e^{-\lvert\theta_j\rvert/b}$	Gaussian: $p(\theta_j) \propto e^{-\theta_j^2/2\tau^2}$
Best when	True model is sparse ( $s \ll p$ )	True model is dense (all features contribute)
Multicollinearity	Picks one from correlated group	Distributes weight across correlated group
Convergence rate	$O(\sqrt{s \log p / n})$	$O(\sqrt{p / n})$
Differentiable	No (not at $\theta_j = 0$ )	Yes, everywhere

When Each Wins

L1 wins: sparse high-dimensional problems

When $p \gg n$ and only $s \ll p$ features are relevant, L1 is statistically optimal. The Lasso achieves the minimax rate $\|\hat{\theta} - \theta^*\|_2 = O(\sqrt{s \log p / n})$ under restricted eigenvalue conditions. L1 also provides automatic variable selection: the zero coefficients tell you which features are irrelevant.

L2 wins: dense models with multicollinearity

When all features contribute with small, roughly equal coefficients, Ridge is preferred. L2 distributes weight across correlated features rather than arbitrarily selecting one. Ridge also has a closed-form solution, making it computationally cheaper. The rate $O(\sqrt{p/n})$ is better than Lasso's rate when $s$ is close to $p$ .

L2 wins: deep learning weight decay

In neural networks, L2 regularization (weight decay) is standard. Sparsity at the individual weight level is less meaningful when the model has millions of parameters and features are learned, not given. AdamW decouples weight decay from the adaptive gradient, and this L2-style penalty helps control the magnitude of weights without forcing exact zeros.

L1 wins: interpretability requirements

A model with 12 nonzero features out of 5,000 is immediately interpretable. A model with 5,000 small nonzero coefficients is not. In genomics, clinical, and scientific applications, knowing which features matter is often the primary goal.

Solution Uniqueness

The L2-regularized objective is strictly convex (the Hessian is $X^TX + 2\lambda I$ , which is positive definite for $\lambda > 0$ ). The solution is always unique.

The L1-regularized objective is convex but not strictly convex. When features are highly correlated, multiple solutions can achieve the same objective value. The Lasso solution is unique only when $X$ satisfies certain conditions (e.g., the columns are in general position). This non-uniqueness with correlated features is a practical limitation: which feature Lasso selects from a correlated group can be unstable across bootstrap samples.

The Elastic Net Bridge

Definition

Elastic Net

The elastic net combines both penalties with mixing parameter $\alpha \in [0, 1]$ :

$\min_{\theta} \mathcal{L}(\theta) + \lambda\left(\alpha \|\theta\|_1 + (1 - \alpha) \|\theta\|_2^2\right)$

When $\alpha = 1$ , this is Lasso. When $\alpha = 0$ , this is Ridge. The L2 component makes the objective strictly convex (unique solution) while the L1 component maintains sparsity. Elastic net selects groups of correlated features together, combining the advantages of both.

Common Confusions

Watch Out

L1 regularization is not the same as L1 loss

L1 regularization penalizes the coefficients: $\lambda \|\theta\|_1$ . L1 loss penalizes the residuals: $\sum |y_i - \hat{y}_i|$ . Lasso uses L1 regularization with L2 (squared) loss. Least absolute deviations (LAD) regression uses L1 loss with no regularization. These are completely different methods.

Watch Out

L2 regularization is not identical to weight decay in all optimizers

For SGD, L2 regularization and weight decay are equivalent. For adaptive optimizers like Adam, they differ. L2 regularization adds $\lambda \theta$ to the gradient before the adaptive scaling. Weight decay subtracts $\lambda \theta$ from the parameters after the update. AdamW implements decoupled weight decay, which is the correct formulation for adaptive methods.

Watch Out

Sparsity requires sufficient regularization strength

L1 does not always produce sparse solutions. When $\lambda$ is very small, the Lasso solution approaches OLS and all coefficients are nonzero. Sparsity increases with $\lambda$ . The regularization path (from large to small $\lambda$ ) reveals features entering one at a time, which is the basis of the LARS algorithm.

Watch Out

The Bayesian MAP interpretation has limits

Ridge corresponds to a Gaussian prior and Lasso to a Laplace prior, but only for the MAP point estimate. The full Bayesian posterior under a Laplace prior is not sparse. If you want truly sparse Bayesian inference, use spike-and-slab priors or the horseshoe prior, not just a Laplace prior.

References

Tibshirani, R. (1996). "Regression shrinkage and selection via the Lasso." Journal of the Royal Statistical Society: Series B, 58(1), 267-288.
Hoerl, A. E. and Kennard, R. W. (1970). "Ridge regression: biased estimation for nonorthogonal problems." Technometrics, 12(1), 55-67.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapters 3.4 (Ridge, Lasso) and 3.8 (Elastic Net).
Zou, H. and Hastie, T. (2005). "Regularization and variable selection via the elastic net." Journal of the Royal Statistical Society: Series B, 67(2), 301-320.
Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Chapter 7 (Lasso bounds under restricted eigenvalue conditions).
Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. (AdamW and the distinction between L2 and weight decay.)
Bach, F. (2012). "Optimization with sparsity-inducing penalties." Foundations and Trends in Machine Learning, 4(1), 1-106.