Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Regression Methods

Lasso Regression

OLS with L1 regularization: sparsity, the geometry of why L1 selects variables, proximal gradient descent, LARS, and elastic net.

CoreTier 1Stable~65 min

Why This Matters

In high-dimensional settings (dnd \gg n or dd comparable to nn), most features are irrelevant. You need a method that automatically selects which features matter. Ridge regression shrinks all coefficients but never sets any to exactly zero. It cannot do variable selection.

The lasso (Least Absolute Shrinkage and Selection Operator) uses an L1 penalty that drives coefficients exactly to zero, producing sparse models. This is the foundational method for high-dimensional statistics and the starting point for compressed sensing, sparse recovery, and modern variable selection theory.

Ridge (L2)||w||₂² ≤ tshrunk solutionwwLasso (L1)||w||₁ ≤ tsparse solution (w₂=0)wwL1 diamond has corners on axes, so the constrained optimum often lands on an axis (sparsity). L2 circle has no corners.

Mental Model

The L1 penalty λw1\lambda\|w\|_1 adds a "diamond-shaped" constraint to the OLS problem. The corners of the L1 ball lie on the coordinate axes, where one or more coordinates are exactly zero. The OLS solution, when projected onto this diamond, tends to hit a corner. producing a sparse solution. The L2 ball (ridge) is round with no corners, so projections onto it generically land at a point where all coordinates are nonzero.

This geometric difference is the entire reason lasso selects variables and ridge does not.

Formal Setup

Definition

Lasso Regression

The lasso estimator with regularization parameter λ>0\lambda > 0 minimizes:

w^lasso=argminwRd12nyXw22+λw1\hat{w}_{\text{lasso}} = \arg\min_{w \in \mathbb{R}^d} \frac{1}{2n}\|y - Xw\|_2^2 + \lambda \|w\|_1

where w1=j=1dwj\|w\|_1 = \sum_{j=1}^d |w_j| is the L1 norm.

The factor 1/(2n)1/(2n) is a convention that makes λ\lambda scale-free with respect to sample size. Some sources use 1/21/2 instead of 1/(2n)1/(2n).

Key difference from ridge: The L1 norm is not differentiable at zero. This non-smoothness is precisely what enables sparsity.

Why L1 Produces Sparsity

Geometric Intuition

Rewrite the lasso as a constrained problem (by Lagrangian duality):

minw12nyXw22subject tow1t\min_w \frac{1}{2n}\|y - Xw\|_2^2 \quad \text{subject to} \quad \|w\|_1 \leq t

for some tt depending on λ\lambda. The feasible set {w:w1t}\{w : \|w\|_1 \leq t\} is a diamond (cross-polytope) in Rd\mathbb{R}^d.

The level sets of the quadratic loss yXw22\|y - Xw\|_2^2 are ellipsoids centered at w^OLS\hat{w}_{\text{OLS}}. The constrained minimum is the first point where an expanding ellipsoid touches the diamond. Because the diamond has sharp corners aligned with the coordinate axes, the contact point is almost always at a corner. where one or more coordinates are zero.

In contrast, the ridge constraint w2t\|w\|_2 \leq t is a ball. Ellipsoids generically touch a ball at a point where all coordinates are nonzero.

Subdifferential Argument

Definition

Subdifferential of the L1 Norm

The L1 norm w1=jwj\|w\|_1 = \sum_j |w_j| has subdifferential:

wj={{+1}if wj>0{1}if wj<0[1,1]if wj=0\partial |w_j| = \begin{cases} \{+1\} & \text{if } w_j > 0 \\ \{-1\} & \text{if } w_j < 0 \\ [-1, 1] & \text{if } w_j = 0 \end{cases}

The key point: when wj=0w_j = 0, the subdifferential is the entire interval [1,1][-1, 1]. This gives the optimizer "room" to set wj=0w_j = 0 and still satisfy the optimality conditions.

Main Theorems

Theorem

Lasso KKT Conditions

Statement

A vector w^\hat{w} is a solution to the lasso if and only if for every coordinate j=1,,dj = 1, \ldots, d:

1nxj(yXw^)=λsj\frac{1}{n} x_j^\top (y - X\hat{w}) = \lambda s_j

where sjw^js_j \in \partial |\hat{w}_j|, i.e.:

  • If w^j>0\hat{w}_j > 0: 1nxj(yXw^)=λ\frac{1}{n} x_j^\top (y - X\hat{w}) = \lambda
  • If w^j<0\hat{w}_j < 0: 1nxj(yXw^)=λ\frac{1}{n} x_j^\top (y - X\hat{w}) = -\lambda
  • If w^j=0\hat{w}_j = 0: 1nxj(yXw^)λ\left|\frac{1}{n} x_j^\top (y - X\hat{w})\right| \leq \lambda

Here xjx_j is the jj-th column of XX.

Intuition

The quantity 1nxj(yXw^)\frac{1}{n} x_j^\top (y - X\hat{w}) is the correlation between feature jj and the current residual. The KKT conditions say: a coefficient is nonzero only if its feature has correlation with the residual equal to ±λ\pm\lambda. If the correlation is below λ\lambda in absolute value, the coefficient is set to zero. This is the mechanism of variable selection: features with weak signal (low correlation with residual) are dropped.

Proof Sketch

The lasso objective is L(w)=12nyXw22+λw1L(w) = \frac{1}{2n}\|y - Xw\|_2^2 + \lambda\|w\|_1. This is convex but not smooth. The optimality condition is 0L(w^)0 \in \partial L(\hat{w}), where \partial is the subdifferential.

The subdifferential of the smooth part is {1nX(yXw^)}\{-\frac{1}{n}X^\top(y - X\hat{w})\}. The subdifferential of λw1\lambda\|w\|_1 is the product of the coordinate subdifferentials λwj\lambda \partial |w_j|. Summing and setting the jj-th component to contain zero gives the stated conditions.

Why It Matters

The KKT conditions explain exactly when and why coefficients are zero. They show that lasso performs a form of "soft thresholding". features must have sufficient correlation with the residual to earn a nonzero coefficient. This is the mathematical foundation of variable selection via L1 regularization.

Failure Mode

When features are highly correlated, the lasso tends to select one and set the others to zero, even if all are relevant. The choice of which correlated feature is selected can be unstable. This is the main motivation for the elastic net, which handles grouped variables better.

Proposition

Lasso Produces Sparse Solutions

Statement

For λ>0\lambda > 0, the lasso solution w^lasso\hat{w}_{\text{lasso}} has at most min(n,d)\min(n, d) nonzero coefficients. In particular, if λ\lambda is sufficiently large, w^lasso=0\hat{w}_{\text{lasso}} = 0.

More precisely, w^j=0\hat{w}_j = 0 whenever:

1nxjyλ\left|\frac{1}{n} x_j^\top y\right| \leq \lambda

(considering the case where all other coefficients are also zero). The smallest λ\lambda for which the entire solution is zero is λmax=maxj1nxjy\lambda_{\max} = \max_j \left|\frac{1}{n} x_j^\top y\right|.

Intuition

As λ\lambda increases, the L1 penalty dominates and more coefficients are driven to zero. The "lasso path" traces the solution as λ\lambda varies from λmax\lambda_{\max} (all zeros) down to 00 (approaching OLS). Features enter the model one at a time as λ\lambda decreases, in order of their marginal correlation with the response.

Proof Sketch

From the KKT conditions: w^j=0\hat{w}_j = 0 requires xj(yXw^)/nλ|x_j^\top(y - X\hat{w})/n| \leq \lambda. When all coefficients are zero, the residual is yy itself, so the condition becomes xjy/nλ|x_j^\top y/n| \leq \lambda. This holds for all jj when λmaxjxjy/n\lambda \geq \max_j |x_j^\top y/n|.

The bound of min(n,d)\min(n, d) nonzero coefficients follows from the fact that the lasso (with the 1/(2n)1/(2n) scaling) at optimality satisfies a system of linear equations in the active variables, and the rank of XSX_S (the submatrix of active columns) is at most min(n,S)\min(n, |S|).

Why It Matters

This is why the lasso is useful in high-dimensional settings. Even when dnd \gg n (far more features than samples), the lasso returns a model with at most nn nonzero coefficients. Combined with appropriate statistical assumptions (e.g., restricted eigenvalue conditions), the lasso can consistently recover the true sparse support.

Failure Mode

Sparsity of the solution does not guarantee that the correct variables are selected. The lasso requires conditions on XX (incoherence, RIP, or restricted eigenvalue) to guarantee correct support recovery. Without these, it may select the wrong sparse set.

Soft Thresholding and Algorithms

Soft Thresholding Operator

In the orthonormal design case (XX/n=IX^\top X/n = I), the lasso has a closed-form solution via the soft thresholding operator:

w^j=Sλ(w^jOLS)=sign(w^jOLS)(w^jOLSλ)+\hat{w}_j = S_\lambda(\hat{w}_j^{\text{OLS}}) = \text{sign}(\hat{w}_j^{\text{OLS}})(|\hat{w}_j^{\text{OLS}}| - \lambda)_+

where (x)+=max(x,0)(x)_+ = \max(x, 0). Coefficients with w^jOLSλ|\hat{w}_j^{\text{OLS}}| \leq \lambda are set exactly to zero. Others are shrunk toward zero by λ\lambda.

Compare to ridge in the orthonormal case: w^jridge=w^jOLS/(1+λ)\hat{w}_j^{\text{ridge}} = \hat{w}_j^{\text{OLS}} / (1 + \lambda). a multiplicative shrinkage that never reaches zero.

ISTA (Iterative Shrinkage-Thresholding Algorithm)

For general XX, the lasso has no closed form. The standard algorithm is proximal gradient descent (ISTA):

w(k+1)=Sλ/L ⁣(w(k)+1LX(yXw(k))/n)w^{(k+1)} = S_{\lambda/L}\!\left(w^{(k)} + \frac{1}{L} X^\top(y - Xw^{(k)})/n\right)

where L=XXop/nL = \|X^\top X\|_{\text{op}}/n is the Lipschitz constant of the gradient. Each step takes a gradient step on the smooth part, then applies soft thresholding for the L1 penalty. Convergence rate: O(1/k)O(1/k).

FISTA (Fast ISTA) adds Nesterov momentum to achieve O(1/k2)O(1/k^2) convergence.

LARS Algorithm

The Least Angle Regression Stagewise (LARS) algorithm computes the entire lasso path (solutions for all λ\lambda) in the cost of a single OLS fit. It exploits the fact that the lasso path is piecewise linear: as λ\lambda decreases, coefficients enter or leave the active set at discrete "kink" points, and between kinks the solution varies linearly.

Elastic Net

Definition

Elastic Net

The elastic net combines L1 and L2 penalties:

w^EN=argminw12nyXw22+λ1w1+λ2w22\hat{w}_{\text{EN}} = \arg\min_w \frac{1}{2n}\|y - Xw\|_2^2 + \lambda_1 \|w\|_1 + \lambda_2 \|w\|_2^2

Why both? The L1 penalty provides sparsity. The L2 penalty handles correlated features by grouping them: when features are correlated, elastic net tends to include all of them (with shrunken coefficients) rather than arbitrarily selecting one.

Elastic net = lasso when λ2=0\lambda_2 = 0, ridge when λ1=0\lambda_1 = 0.

Irrepresentable Condition for Support Recovery

The lasso produces sparse solutions, but sparsity alone does not guarantee that the correct variables are selected. The irrepresentable condition (also called the mutual incoherence condition in some formulations) is a necessary and sufficient condition for the lasso to recover the true support S={j:wj0}S = \{j : w_j^* \neq 0\} as nn \to \infty.

Definition

Irrepresentable Condition

Let SS be the true support and ScS^c its complement. Partition X=[XS  XSc]X = [X_S \; X_{S^c}]. The irrepresentable condition requires:

XScXS(XSXS)1sign(wS)<1\|X_{S^c}^\top X_S (X_S^\top X_S)^{-1} \text{sign}(w_S^*)\|_\infty < 1

where sign(wS)\text{sign}(w_S^*) is the vector of signs of the true nonzero coefficients.

The matrix XScXS(XSXS)1X_{S^c}^\top X_S (X_S^\top X_S)^{-1} measures how well the irrelevant features XScX_{S^c} can be represented as linear combinations of the relevant features XSX_S. The condition fails when irrelevant features are too correlated with the relevant ones. In that case, the lasso cannot distinguish signal from noise.

Zhao and Yu (2006) proved this is necessary for sign consistency: if the condition fails, there exists a sequence of true parameters for which the lasso selects the wrong support with probability approaching 1. This is a fundamental limitation, not a fixable algorithmic issue.

Comparison with Ridge

The distinction between lasso and ridge regression is geometric. Ridge uses the L2 penalty λw22\lambda\|w\|_2^2, whose constraint set is a ball. The lasso uses L1 with a diamond constraint. This difference has three consequences.

Variable selection. Ridge shrinks all coefficients toward zero but never sets any exactly to zero. Lasso sets coefficients to zero. For interpretation and feature selection, lasso wins. For prediction when all features are mildly relevant, ridge often wins.

Uniqueness. Ridge always has a unique solution: (XX+λI)(X^\top X + \lambda I) is invertible for λ>0\lambda > 0. The lasso solution can be non-unique when n<dn < d or when features are perfectly correlated.

Bias-variance tradeoff. Lasso introduces more bias on large coefficients (soft thresholding subtracts a constant) but can have lower variance by eliminating irrelevant features. Ridge introduces less bias on large coefficients (multiplicative shrinkage) but retains variance from irrelevant features. See Ridge vs. Lasso for a detailed comparison.

Practical Tuning

Choosing Lambda via Cross-Validation

The regularization parameter λ\lambda controls the sparsity level. The standard approach:

  1. Define a grid of λ\lambda values from λmax\lambda_{\max} (all zeros) down to λmax/1000\lambda_{\max}/1000 on a log scale. Typically 100 values.
  2. For each λ\lambda, run KK-fold cross-validation (K=5K = 5 or 1010). Compute mean squared prediction error on held-out folds.
  3. Select λ\lambda that minimizes CV error (λmin\lambda_{\min}), or use the "one-standard-error rule": select the largest λ\lambda whose CV error is within one standard error of the minimum. The one-SE rule favors sparser models and often generalizes better.

Warm starts. Compute solutions along the λ\lambda path from large to small. Use the solution at λk\lambda_k as the starting point for λk+1\lambda_{k+1}. This exploits the fact that the lasso path is piecewise linear, making the full path computation efficient.

Standardization. Always standardize features to have mean zero and unit variance before fitting the lasso. Otherwise, λ\lambda penalizes each coefficient on different scales, and features measured in larger units are penalized more heavily.

The Lasso Path in Detail

The lasso path is the function w^(λ)\hat{w}(\lambda) mapping each penalty value to the corresponding solution. It has three properties that make it computationally and statistically useful.

First, the path is piecewise linear. As λ\lambda decreases from λmax\lambda_{\max}, coefficients enter (become nonzero) or exit (return to zero) at discrete breakpoints. Between breakpoints, each active coefficient changes linearly in λ\lambda. This means the entire path can be computed by tracking the breakpoints, which is what the LARS algorithm does.

Second, features enter the path in order of their marginal correlation with the response. The first feature to become nonzero at λmax\lambda_{\max} is argmaxjxjy/n\arg\max_j |x_j^\top y / n|. This is the feature most correlated with yy. As λ\lambda decreases, the next feature to enter is the one most correlated with the current residual. This gives a natural ranking of feature importance.

Third, the path provides a stability diagnostic. If a feature enters the model at a large λ\lambda (early), it is strongly associated with the response. If it enters only at a very small λ\lambda (late), the association is weak. Features that enter and exit the path multiple times are unstable and should be interpreted cautiously.

Example

Lasso path for diabetes data

The classic diabetes dataset has 10 features and 442 observations. Running the lasso path from λmax\lambda_{\max} to λmax/1000\lambda_{\max}/1000:

At λ/λmax=1.0\lambda/\lambda_{\max} = 1.0: all coefficients zero. At λ/λmax=0.5\lambda/\lambda_{\max} = 0.5: BMI enters first (strongest marginal correlation). At λ/λmax=0.3\lambda/\lambda_{\max} = 0.3: blood pressure and one blood serum marker enter. At λ/λmax=0.05\lambda/\lambda_{\max} = 0.05: 8 of 10 features are active. At λ/λmax=0.001\lambda/\lambda_{\max} = 0.001: all 10 features are active, coefficients approach OLS.

Cross-validation selects λ\lambda near 0.05λmax0.05 \cdot \lambda_{\max}, keeping 8 features and dropping the 2 with weakest signal. The one-SE rule selects a slightly larger λ\lambda with 6 features.

Coordinate Descent

In practice, the dominant algorithm for lasso is coordinate descent, not ISTA. Each coordinate update has the closed form w^jSλ(zj)\hat{w}_j \leftarrow S_{\lambda}(z_j) where zjz_j is the partial residual. Cycling through coordinates until convergence is faster than proximal gradient methods for moderate-dimensional problems and is the algorithm used in the glmnet package.

The algorithm works as follows. Fix all coefficients except wjw_j. The partial residual is rj=yXjwjr_j = y - X_{-j} w_{-j}, where XjX_{-j} is XX with column jj removed. The lasso objective in wjw_j alone reduces to minwj12nrjxjwj22+λwj\min_{w_j} \frac{1}{2n}\|r_j - x_j w_j\|_2^2 + \lambda |w_j|, which has the closed-form solution w^j=Sλ(xjrj/(xjxj/n))\hat{w}_j = S_\lambda(x_j^\top r_j / (x_j^\top x_j / n)). One full cycle through all dd coordinates is one iteration. Convergence is typically fast because the active set stabilizes quickly: after a few cycles, most coordinates that will be zero are already zero, and the algorithm only updates the active set.

When Lasso Fails

  1. Highly correlated features: If xjxkx_j \approx x_k, lasso picks one arbitrarily and zeros the other. The selected feature is unstable across subsamples. Use elastic net instead.

  2. n<dn < d with more than nn relevant features: Lasso can select at most nn features. If the true model has more than nn nonzero coefficients, lasso cannot recover all of them.

  3. Non-sparse truth: If the true ww^* has many small nonzero coefficients (not truly sparse), lasso may have higher MSE than ridge. Ridge is better for "dense" truth.

  4. Irrepresentable condition violated: When irrelevant features are too correlated with relevant ones, lasso cannot recover the correct support, regardless of sample size.

Canonical Examples

Example

Lasso vs ridge on a sparse problem

Suppose d=100d = 100 features but only 55 have nonzero coefficients. With n=50n = 50 samples:

  • Lasso sets roughly 45-95 coefficients to zero, identifying the important features. Prediction error is good.
  • Ridge shrinks all 100 coefficients, keeping all nonzero. It "spreads" the signal across correlated features instead of selecting. Higher prediction error in the sparse setting.

The advantage reverses if all 100 features are mildly relevant.

Example

Soft thresholding visualization

With orthonormal design, OLS coefficient w^jOLS=2.3\hat{w}_j^{\text{OLS}} = 2.3 and λ=1.0\lambda = 1.0:

  • Lasso: S1(2.3)=sign(2.3)(2.31)+=1.3S_1(2.3) = \text{sign}(2.3) \cdot (|2.3| - 1)_+ = 1.3
  • Ridge: 2.3/(1+1)=1.152.3 / (1 + 1) = 1.15

For w^jOLS=0.7\hat{w}_j^{\text{OLS}} = 0.7 and λ=1.0\lambda = 1.0:

  • Lasso: S1(0.7)=0S_1(0.7) = 0 (set exactly to zero)
  • Ridge: 0.7/2=0.350.7 / 2 = 0.35 (still nonzero)

The lasso kills small coefficients; ridge only dampens them.

Common Confusions

Watch Out

Lasso does variable selection, ridge does not

This is the single most important distinction between lasso and ridge. The L1 penalty has non-smooth corners at zero, creating a region where the optimality conditions are satisfied with wj=0w_j = 0. The L2 penalty is smooth everywhere, so the only solution to the ridge optimality conditions with wj=0w_j = 0 requires the data to have zero correlation with feature jj. which generically does not happen.

Watch Out

Lasso has no closed form in general

Unlike ridge (w^=(XX+λI)1Xy\hat{w} = (X^\top X + \lambda I)^{-1} X^\top y), the lasso has a closed form only when XXX^\top X is diagonal (orthonormal design). In general, you must use iterative algorithms (ISTA, coordinate descent, LARS). This makes the lasso computationally harder than ridge, though still convex and efficiently solvable.

Watch Out

The 1/(2n) scaling convention matters

Some sources define the lasso with 12yXw22\frac{1}{2}\|y - Xw\|_2^2 instead of 12nyXw22\frac{1}{2n}\|y - Xw\|_2^2. This changes the scale of λ\lambda by a factor of nn. When comparing results across sources or software packages, always check the scaling convention.

Summary

  • Lasso objective: minw12nyXw22+λw1\min_w \frac{1}{2n}\|y - Xw\|_2^2 + \lambda\|w\|_1
  • L1 penalty drives coefficients exactly to zero (variable selection)
  • Geometric reason: L1 ball has corners on axes; ellipsoids hit corners
  • KKT: wj=0w_j = 0 when feature jj has correlation <λ< \lambda with residual
  • No closed form in general; use ISTA, FISTA, or coordinate descent
  • Orthonormal case: soft thresholding Sλ(z)=sign(z)(zλ)+S_\lambda(z) = \text{sign}(z)(|z| - \lambda)_+
  • Elastic net = L1 + L2 fixes issues with correlated features
  • At most min(n,d)\min(n, d) nonzero coefficients

Exercises

ExerciseCore

Problem

In the orthonormal design case (XX/n=IX^\top X/n = I), derive the lasso solution w^j=Sλ(w^jOLS)\hat{w}_j = S_\lambda(\hat{w}_j^{\text{OLS}}) by solving the lasso optimization problem coordinate-by-coordinate.

ExerciseCore

Problem

Find λmax\lambda_{\max}, the smallest value of λ\lambda for which the lasso solution is w^=0\hat{w} = 0. Express it in terms of XX and yy.

ExerciseAdvanced

Problem

Show that when two features are identical (x1=x2x_1 = x_2), any lasso solution satisfies w^1+w^2=c\hat{w}_1 + \hat{w}_2 = c for some constant cc (the individual values are not uniquely determined). How does the elastic net resolve this?

Related Comparisons

References

Canonical:

  • Tibshirani, "Regression Shrinkage and Selection via the Lasso," JRSS-B (1996)
  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009), Chapter 3.4
  • Efron et al., "Least Angle Regression," Annals of Statistics (2004)

Support recovery and theory:

  • Zhao & Yu, "On Model Selection Consistency of Lasso," JMLR (2006). Irrepresentable condition.
  • Wainwright, High-Dimensional Statistics (2019), Chapters 7-8

Current:

  • Hastie, Tibshirani, Wainwright, Statistical Learning with Sparsity (2015), Chapters 2-3, 5
  • Friedman, Hastie, Tibshirani, "Regularization Paths for GLMs via Coordinate Descent," J. Stat. Software (2010). The glmnet algorithm.

Next Topics

Building on lasso regression:

  • Elastic net: combining L1 and L2 for correlated features
  • Compressed sensing: sparse recovery beyond regression
  • High-dimensional statistics: theory of estimation when dnd \gg n

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics