Regression Methods
Lasso Regression
OLS with L1 regularization: sparsity, the geometry of why L1 selects variables, proximal gradient descent, LARS, and elastic net.
Prerequisites
Why This Matters
In high-dimensional settings ( or comparable to ), most features are irrelevant. You need a method that automatically selects which features matter. Ridge regression shrinks all coefficients but never sets any to exactly zero. It cannot do variable selection.
The lasso (Least Absolute Shrinkage and Selection Operator) uses an L1 penalty that drives coefficients exactly to zero, producing sparse models. This is the foundational method for high-dimensional statistics and the starting point for compressed sensing, sparse recovery, and modern variable selection theory.
Mental Model
The L1 penalty adds a "diamond-shaped" constraint to the OLS problem. The corners of the L1 ball lie on the coordinate axes, where one or more coordinates are exactly zero. The OLS solution, when projected onto this diamond, tends to hit a corner. producing a sparse solution. The L2 ball (ridge) is round with no corners, so projections onto it generically land at a point where all coordinates are nonzero.
This geometric difference is the entire reason lasso selects variables and ridge does not.
Formal Setup
Lasso Regression
The lasso estimator with regularization parameter minimizes:
where is the L1 norm.
The factor is a convention that makes scale-free with respect to sample size. Some sources use instead of .
Key difference from ridge: The L1 norm is not differentiable at zero. This non-smoothness is precisely what enables sparsity.
Why L1 Produces Sparsity
Geometric Intuition
Rewrite the lasso as a constrained problem (by Lagrangian duality):
for some depending on . The feasible set is a diamond (cross-polytope) in .
The level sets of the quadratic loss are ellipsoids centered at . The constrained minimum is the first point where an expanding ellipsoid touches the diamond. Because the diamond has sharp corners aligned with the coordinate axes, the contact point is almost always at a corner. where one or more coordinates are zero.
In contrast, the ridge constraint is a ball. Ellipsoids generically touch a ball at a point where all coordinates are nonzero.
Subdifferential Argument
Subdifferential of the L1 Norm
The L1 norm has subdifferential:
The key point: when , the subdifferential is the entire interval . This gives the optimizer "room" to set and still satisfy the optimality conditions.
Main Theorems
Lasso KKT Conditions
Statement
A vector is a solution to the lasso if and only if for every coordinate :
where , i.e.:
- If :
- If :
- If :
Here is the -th column of .
Intuition
The quantity is the correlation between feature and the current residual. The KKT conditions say: a coefficient is nonzero only if its feature has correlation with the residual equal to . If the correlation is below in absolute value, the coefficient is set to zero. This is the mechanism of variable selection: features with weak signal (low correlation with residual) are dropped.
Proof Sketch
The lasso objective is . This is convex but not smooth. The optimality condition is , where is the subdifferential.
The subdifferential of the smooth part is . The subdifferential of is the product of the coordinate subdifferentials . Summing and setting the -th component to contain zero gives the stated conditions.
Why It Matters
The KKT conditions explain exactly when and why coefficients are zero. They show that lasso performs a form of "soft thresholding". features must have sufficient correlation with the residual to earn a nonzero coefficient. This is the mathematical foundation of variable selection via L1 regularization.
Failure Mode
When features are highly correlated, the lasso tends to select one and set the others to zero, even if all are relevant. The choice of which correlated feature is selected can be unstable. This is the main motivation for the elastic net, which handles grouped variables better.
Lasso Produces Sparse Solutions
Statement
For , the lasso solution has at most nonzero coefficients. In particular, if is sufficiently large, .
More precisely, whenever:
(considering the case where all other coefficients are also zero). The smallest for which the entire solution is zero is .
Intuition
As increases, the L1 penalty dominates and more coefficients are driven to zero. The "lasso path" traces the solution as varies from (all zeros) down to (approaching OLS). Features enter the model one at a time as decreases, in order of their marginal correlation with the response.
Proof Sketch
From the KKT conditions: requires . When all coefficients are zero, the residual is itself, so the condition becomes . This holds for all when .
The bound of nonzero coefficients follows from the fact that the lasso (with the scaling) at optimality satisfies a system of linear equations in the active variables, and the rank of (the submatrix of active columns) is at most .
Why It Matters
This is why the lasso is useful in high-dimensional settings. Even when (far more features than samples), the lasso returns a model with at most nonzero coefficients. Combined with appropriate statistical assumptions (e.g., restricted eigenvalue conditions), the lasso can consistently recover the true sparse support.
Failure Mode
Sparsity of the solution does not guarantee that the correct variables are selected. The lasso requires conditions on (incoherence, RIP, or restricted eigenvalue) to guarantee correct support recovery. Without these, it may select the wrong sparse set.
Soft Thresholding and Algorithms
Soft Thresholding Operator
In the orthonormal design case (), the lasso has a closed-form solution via the soft thresholding operator:
where . Coefficients with are set exactly to zero. Others are shrunk toward zero by .
Compare to ridge in the orthonormal case: . a multiplicative shrinkage that never reaches zero.
ISTA (Iterative Shrinkage-Thresholding Algorithm)
For general , the lasso has no closed form. The standard algorithm is proximal gradient descent (ISTA):
where is the Lipschitz constant of the gradient. Each step takes a gradient step on the smooth part, then applies soft thresholding for the L1 penalty. Convergence rate: .
FISTA (Fast ISTA) adds Nesterov momentum to achieve convergence.
LARS Algorithm
The Least Angle Regression Stagewise (LARS) algorithm computes the entire lasso path (solutions for all ) in the cost of a single OLS fit. It exploits the fact that the lasso path is piecewise linear: as decreases, coefficients enter or leave the active set at discrete "kink" points, and between kinks the solution varies linearly.
Elastic Net
Elastic Net
The elastic net combines L1 and L2 penalties:
Why both? The L1 penalty provides sparsity. The L2 penalty handles correlated features by grouping them: when features are correlated, elastic net tends to include all of them (with shrunken coefficients) rather than arbitrarily selecting one.
Elastic net = lasso when , ridge when .
Irrepresentable Condition for Support Recovery
The lasso produces sparse solutions, but sparsity alone does not guarantee that the correct variables are selected. The irrepresentable condition (also called the mutual incoherence condition in some formulations) is a necessary and sufficient condition for the lasso to recover the true support as .
Irrepresentable Condition
Let be the true support and its complement. Partition . The irrepresentable condition requires:
where is the vector of signs of the true nonzero coefficients.
The matrix measures how well the irrelevant features can be represented as linear combinations of the relevant features . The condition fails when irrelevant features are too correlated with the relevant ones. In that case, the lasso cannot distinguish signal from noise.
Zhao and Yu (2006) proved this is necessary for sign consistency: if the condition fails, there exists a sequence of true parameters for which the lasso selects the wrong support with probability approaching 1. This is a fundamental limitation, not a fixable algorithmic issue.
Comparison with Ridge
The distinction between lasso and ridge regression is geometric. Ridge uses the L2 penalty , whose constraint set is a ball. The lasso uses L1 with a diamond constraint. This difference has three consequences.
Variable selection. Ridge shrinks all coefficients toward zero but never sets any exactly to zero. Lasso sets coefficients to zero. For interpretation and feature selection, lasso wins. For prediction when all features are mildly relevant, ridge often wins.
Uniqueness. Ridge always has a unique solution: is invertible for . The lasso solution can be non-unique when or when features are perfectly correlated.
Bias-variance tradeoff. Lasso introduces more bias on large coefficients (soft thresholding subtracts a constant) but can have lower variance by eliminating irrelevant features. Ridge introduces less bias on large coefficients (multiplicative shrinkage) but retains variance from irrelevant features. See Ridge vs. Lasso for a detailed comparison.
Practical Tuning
Choosing Lambda via Cross-Validation
The regularization parameter controls the sparsity level. The standard approach:
- Define a grid of values from (all zeros) down to on a log scale. Typically 100 values.
- For each , run -fold cross-validation ( or ). Compute mean squared prediction error on held-out folds.
- Select that minimizes CV error (), or use the "one-standard-error rule": select the largest whose CV error is within one standard error of the minimum. The one-SE rule favors sparser models and often generalizes better.
Warm starts. Compute solutions along the path from large to small. Use the solution at as the starting point for . This exploits the fact that the lasso path is piecewise linear, making the full path computation efficient.
Standardization. Always standardize features to have mean zero and unit variance before fitting the lasso. Otherwise, penalizes each coefficient on different scales, and features measured in larger units are penalized more heavily.
The Lasso Path in Detail
The lasso path is the function mapping each penalty value to the corresponding solution. It has three properties that make it computationally and statistically useful.
First, the path is piecewise linear. As decreases from , coefficients enter (become nonzero) or exit (return to zero) at discrete breakpoints. Between breakpoints, each active coefficient changes linearly in . This means the entire path can be computed by tracking the breakpoints, which is what the LARS algorithm does.
Second, features enter the path in order of their marginal correlation with the response. The first feature to become nonzero at is . This is the feature most correlated with . As decreases, the next feature to enter is the one most correlated with the current residual. This gives a natural ranking of feature importance.
Third, the path provides a stability diagnostic. If a feature enters the model at a large (early), it is strongly associated with the response. If it enters only at a very small (late), the association is weak. Features that enter and exit the path multiple times are unstable and should be interpreted cautiously.
Lasso path for diabetes data
The classic diabetes dataset has 10 features and 442 observations. Running the lasso path from to :
At : all coefficients zero. At : BMI enters first (strongest marginal correlation). At : blood pressure and one blood serum marker enter. At : 8 of 10 features are active. At : all 10 features are active, coefficients approach OLS.
Cross-validation selects near , keeping 8 features and dropping the 2 with weakest signal. The one-SE rule selects a slightly larger with 6 features.
Coordinate Descent
In practice, the dominant algorithm for lasso is coordinate descent, not ISTA. Each coordinate update has the closed form where is the partial residual. Cycling through coordinates until convergence is faster than proximal gradient methods for moderate-dimensional problems and is the algorithm used in the glmnet package.
The algorithm works as follows. Fix all coefficients except . The partial residual is , where is with column removed. The lasso objective in alone reduces to , which has the closed-form solution . One full cycle through all coordinates is one iteration. Convergence is typically fast because the active set stabilizes quickly: after a few cycles, most coordinates that will be zero are already zero, and the algorithm only updates the active set.
When Lasso Fails
-
Highly correlated features: If , lasso picks one arbitrarily and zeros the other. The selected feature is unstable across subsamples. Use elastic net instead.
-
with more than relevant features: Lasso can select at most features. If the true model has more than nonzero coefficients, lasso cannot recover all of them.
-
Non-sparse truth: If the true has many small nonzero coefficients (not truly sparse), lasso may have higher MSE than ridge. Ridge is better for "dense" truth.
-
Irrepresentable condition violated: When irrelevant features are too correlated with relevant ones, lasso cannot recover the correct support, regardless of sample size.
Canonical Examples
Lasso vs ridge on a sparse problem
Suppose features but only have nonzero coefficients. With samples:
- Lasso sets roughly 45-95 coefficients to zero, identifying the important features. Prediction error is good.
- Ridge shrinks all 100 coefficients, keeping all nonzero. It "spreads" the signal across correlated features instead of selecting. Higher prediction error in the sparse setting.
The advantage reverses if all 100 features are mildly relevant.
Soft thresholding visualization
With orthonormal design, OLS coefficient and :
- Lasso:
- Ridge:
For and :
- Lasso: (set exactly to zero)
- Ridge: (still nonzero)
The lasso kills small coefficients; ridge only dampens them.
Common Confusions
Lasso does variable selection, ridge does not
This is the single most important distinction between lasso and ridge. The L1 penalty has non-smooth corners at zero, creating a region where the optimality conditions are satisfied with . The L2 penalty is smooth everywhere, so the only solution to the ridge optimality conditions with requires the data to have zero correlation with feature . which generically does not happen.
Lasso has no closed form in general
Unlike ridge (), the lasso has a closed form only when is diagonal (orthonormal design). In general, you must use iterative algorithms (ISTA, coordinate descent, LARS). This makes the lasso computationally harder than ridge, though still convex and efficiently solvable.
The 1/(2n) scaling convention matters
Some sources define the lasso with instead of . This changes the scale of by a factor of . When comparing results across sources or software packages, always check the scaling convention.
Summary
- Lasso objective:
- L1 penalty drives coefficients exactly to zero (variable selection)
- Geometric reason: L1 ball has corners on axes; ellipsoids hit corners
- KKT: when feature has correlation with residual
- No closed form in general; use ISTA, FISTA, or coordinate descent
- Orthonormal case: soft thresholding
- Elastic net = L1 + L2 fixes issues with correlated features
- At most nonzero coefficients
Exercises
Problem
In the orthonormal design case (), derive the lasso solution by solving the lasso optimization problem coordinate-by-coordinate.
Problem
Find , the smallest value of for which the lasso solution is . Express it in terms of and .
Problem
Show that when two features are identical (), any lasso solution satisfies for some constant (the individual values are not uniquely determined). How does the elastic net resolve this?
Related Comparisons
References
Canonical:
- Tibshirani, "Regression Shrinkage and Selection via the Lasso," JRSS-B (1996)
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009), Chapter 3.4
- Efron et al., "Least Angle Regression," Annals of Statistics (2004)
Support recovery and theory:
- Zhao & Yu, "On Model Selection Consistency of Lasso," JMLR (2006). Irrepresentable condition.
- Wainwright, High-Dimensional Statistics (2019), Chapters 7-8
Current:
- Hastie, Tibshirani, Wainwright, Statistical Learning with Sparsity (2015), Chapters 2-3, 5
- Friedman, Hastie, Tibshirani, "Regularization Paths for GLMs via Coordinate Descent," J. Stat. Software (2010). The glmnet algorithm.
Next Topics
Building on lasso regression:
- Elastic net: combining L1 and L2 for correlated features
- Compressed sensing: sparse recovery beyond regression
- High-dimensional statistics: theory of estimation when
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Linear RegressionLayer 1
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Differentiation in RnLayer 0A
- Convex Optimization BasicsLayer 1
Builds on This
- Elastic NetLayer 2
- Sparse Recovery and Compressed SensingLayer 4