Lasso Regression

In high-dimensional settings ( $d \gg n$ or $d$ comparable to $n$ ), most features are irrelevant. You need a method that automatically selects which features matter. Ridge regression shrinks all coefficients but never sets any to exactly zero. It cannot do variable selection.

The lasso (Least Absolute Shrinkage and Selection Operator) uses an L1 penalty that drives coefficients exactly to zero, producing sparse models. This is the foundational method for high-dimensional statistics and the starting point for compressed sensing, sparse recovery, and modern variable selection theory.

Mental Model

The L1 penalty $\lambda\|w\|_1$ adds a "diamond-shaped" constraint to the OLS problem. The corners of the L1 ball lie on the coordinate axes, where one or more coordinates are exactly zero. The OLS solution, when projected onto this diamond, tends to hit a corner. producing a sparse solution. The L2 ball (ridge) is round with no corners, so projections onto it generically land at a point where all coordinates are nonzero.

This geometric difference is the entire reason lasso selects variables and ridge does not.

Formal Setup

Definition

Lasso Regression $\overset{w}{^}_{lasso}$

The lasso estimator with regularization parameter $\lambda > 0$ minimizes:

$\hat{w}_{\text{lasso}} = \arg\min_{w \in \mathbb{R}^d} \frac{1}{2n}\|y - Xw\|_2^2 + \lambda \|w\|_1$

where $\|w\|_1 = \sum_{j=1}^d |w_j|$ is the L1 norm.

The factor $1/(2n)$ is a convention that makes $\lambda$ scale-free with respect to sample size. Some sources use $1/2$ instead of $1/(2n)$ .

Key difference from ridge: The L1 norm is not differentiable at zero. This non-smoothness is precisely what enables sparsity.

Why L1 Produces Sparsity

Geometric Intuition

Rewrite the lasso as a constrained problem (by Lagrangian duality):

$\min_w \frac{1}{2n}\|y - Xw\|_2^2 \quad \text{subject to} \quad \|w\|_1 \leq t$

for some $t$ depending on $\lambda$ . The feasible set $\{w : \|w\|_1 \leq t\}$ is a diamond (cross-polytope) in $\mathbb{R}^d$ .

The level sets of the quadratic loss $\|y - Xw\|_2^2$ are ellipsoids centered at $\hat{w}_{\text{OLS}}$ . The constrained minimum is the first point where an expanding ellipsoid touches the diamond. Because the diamond has sharp corners aligned with the coordinate axes, the contact point is almost always at a corner. where one or more coordinates are zero.

In contrast, the ridge constraint $\|w\|_2 \leq t$ is a ball. Ellipsoids generically touch a ball at a point where all coordinates are nonzero.

Subdifferential Argument

Definition

Subdifferential of the L1 Norm

The L1 norm $\|w\|_1 = \sum_j |w_j|$ has subdifferential:

$\partial |w_j| = \begin{cases} \{+1\} & \text{if } w_j > 0 \\ \{-1\} & \text{if } w_j < 0 \\ [-1, 1] & \text{if } w_j = 0 \end{cases}$

The key point: when $w_j = 0$ , the subdifferential is the entire interval $[-1, 1]$ . This gives the optimizer "room" to set $w_j = 0$ and still satisfy the optimality conditions.

Main Theorems

Theorem

Lasso KKT Conditions

Statement

A vector $\hat{w}$ is a solution to the lasso if and only if for every coordinate $j = 1, \ldots, d$ :

$\frac{1}{n} x_j^\top (y - X\hat{w}) = \lambda s_j$

where $s_j \in \partial |\hat{w}_j|$ , i.e.:

If $\hat{w}_j > 0$ : $\frac{1}{n} x_j^\top (y - X\hat{w}) = \lambda$
If $\hat{w}_j < 0$ : $\frac{1}{n} x_j^\top (y - X\hat{w}) = -\lambda$
If $\hat{w}_j = 0$ : $\left|\frac{1}{n} x_j^\top (y - X\hat{w})\right| \leq \lambda$

Here $x_j$ is the $j$ -th column of $X$ .

Intuition

The quantity $\frac{1}{n} x_j^\top (y - X\hat{w})$ is the correlation between feature $j$ and the current residual. The KKT conditions say: a coefficient is nonzero only if its feature has correlation with the residual equal to $\pm\lambda$ . If the correlation is below $\lambda$ in absolute value, the coefficient is set to zero. This is the mechanism of variable selection: features with weak signal (low correlation with residual) are dropped.

Proof Sketch

The lasso objective is $L(w) = \frac{1}{2n}\|y - Xw\|_2^2 + \lambda\|w\|_1$ . This is convex but not smooth. The optimality condition is $0 \in \partial L(\hat{w})$ , where $\partial$ is the subdifferential.

The subdifferential of the smooth part is $\{-\frac{1}{n}X^\top(y - X\hat{w})\}$ . The subdifferential of $\lambda\|w\|_1$ is the product of the coordinate subdifferentials $\lambda \partial |w_j|$ . Summing and setting the $j$ -th component to contain zero gives the stated conditions.

Why It Matters

The KKT conditions explain exactly when and why coefficients are zero. They show that lasso performs a form of "soft thresholding". features must have sufficient correlation with the residual to earn a nonzero coefficient. This is the mathematical foundation of variable selection via L1 regularization.

Failure Mode

When features are highly correlated, the lasso tends to select one and set the others to zero, even if all are relevant. The choice of which correlated feature is selected can be unstable. This is the main motivation for the elastic net, which handles grouped variables better.

report a correction →

Proposition

Lasso Produces Sparse Solutions

Statement

For $\lambda > 0$ , when the columns of $X$ are in general position (a mild genericity condition: no $k+1$ centered columns are affinely dependent for any $k$ , which holds with probability one for continuous data), the lasso solution $\hat{w}_{\text{lasso}}$ is unique and has at most $\min(n, d)$ nonzero coefficients (Tibshirani, 2013). Without general position the solution set can be a polytope of equally optimal $\hat{w}$ with different supports, and basic LARS/coordinate-descent solvers may return any vertex. In particular, if $\lambda$ is sufficiently large, $\hat{w}_{\text{lasso}} = 0$ .

More precisely, $\hat{w}_j = 0$ whenever:

$\left|\frac{1}{n} x_j^\top y\right| \leq \lambda$

(considering the case where all other coefficients are also zero). The smallest $\lambda$ for which the entire solution is zero is $\lambda_{\max} = \max_j \left|\frac{1}{n} x_j^\top y\right|$ .

Intuition

As $\lambda$ increases, the L1 penalty dominates and more coefficients are driven to zero. The "lasso path" traces the solution as $\lambda$ varies from $\lambda_{\max}$ (all zeros) down to $0$ (approaching OLS). Features enter the model one at a time as $\lambda$ decreases, in order of their marginal correlation with the response.

Proof Sketch

From the KKT conditions: $\hat{w}_j = 0$ requires $|x_j^\top(y - X\hat{w})/n| \leq \lambda$ . When all coefficients are zero, the residual is $y$ itself, so the condition becomes $|x_j^\top y/n| \leq \lambda$ . This holds for all $j$ when $\lambda \geq \max_j |x_j^\top y/n|$ .

The bound of $\min(n, d)$ nonzero coefficients follows from the fact that the lasso (with the $1/(2n)$ scaling) at optimality satisfies a system of linear equations in the active variables, and the rank of $X_S$ (the submatrix of active columns) is at most $\min(n, |S|)$ .

Why It Matters

This is why the lasso is useful in high-dimensional settings. Even when $d \gg n$ (far more features than samples), the lasso returns a model with at most $n$ nonzero coefficients. Combined with appropriate statistical assumptions (e.g., restricted eigenvalue conditions), the lasso can consistently recover the true sparse support.

Failure Mode

Sparsity of the solution does not guarantee that the correct variables are selected. The lasso requires conditions on $X$ (incoherence, RIP, or restricted eigenvalue) to guarantee correct support recovery. Without these, it may select the wrong sparse set.

report a correction →

Soft Thresholding and Algorithms

Soft Thresholding Operator

In the orthonormal design case ( $X^\top X/n = I$ ), the lasso has a closed-form solution via the soft thresholding operator:

$\hat{w}_j = S_\lambda(\hat{w}_j^{\text{OLS}}) = \text{sign}(\hat{w}_j^{\text{OLS}})(|\hat{w}_j^{\text{OLS}}| - \lambda)_+$

where $(x)_+ = \max(x, 0)$ . Coefficients with $|\hat{w}_j^{\text{OLS}}| \leq \lambda$ are set exactly to zero. Others are shrunk toward zero by $\lambda$ .

Compare to ridge in the orthonormal case: $\hat{w}_j^{\text{ridge}} = \hat{w}_j^{\text{OLS}} / (1 + \lambda)$ . a multiplicative shrinkage that never reaches zero.

ISTA (Iterative Shrinkage-Thresholding Algorithm)

For general $X$ , the lasso has no closed form. The standard algorithm is proximal gradient descent (ISTA):

$w^{(k+1)} = S_{\lambda/L}\!\left(w^{(k)} + \frac{1}{L} X^\top(y - Xw^{(k)})/n\right)$

where $L = \|X^\top X\|_{\text{op}}/n$ is the Lipschitz constant of the gradient. Each step takes a gradient step on the smooth part, then applies soft thresholding for the L1 penalty. Convergence rate: $O(1/k)$ .

FISTA (Fast ISTA) adds Nesterov momentum to achieve $O(1/k^2)$ convergence.

LARS Algorithm

The Least Angle Regression Stagewise (LARS) algorithm computes the entire lasso path (solutions for all $\lambda$ ) in the cost of a single OLS fit. It exploits the fact that the lasso path is piecewise linear: as $\lambda$ decreases, coefficients enter or leave the active set at discrete "kink" points, and between kinks the solution varies linearly.

Elastic Net

Definition

Elastic Net $\overset{w}{^}_{EN}$

The elastic net combines L1 and L2 penalties:

$\hat{w}_{\text{EN}} = \arg\min_w \frac{1}{2n}\|y - Xw\|_2^2 + \lambda_1 \|w\|_1 + \lambda_2 \|w\|_2^2$

Why both? The L1 penalty provides sparsity. The L2 penalty handles correlated features by grouping them: when features are correlated, elastic net tends to include all of them (with shrunken coefficients) rather than arbitrarily selecting one.

Elastic net = lasso when $\lambda_2 = 0$ , ridge when $\lambda_1 = 0$ .

Irrepresentable Condition for Support Recovery

The lasso produces sparse solutions, but sparsity alone does not guarantee that the correct variables are selected. The irrepresentable condition (also called the mutual incoherence condition in some formulations) is a necessary and sufficient condition for the lasso to recover the true support $S = \{j : w_j^* \neq 0\}$ as $n \to \infty$ .

Definition

Irrepresentable Condition

Let $S$ be the true support and $S^c$ its complement. Partition $X = [X_S \; X_{S^c}]$ . The irrepresentable condition requires:

$\|X_{S^c}^\top X_S (X_S^\top X_S)^{-1} \text{sign}(w_S^*)\|_\infty < 1$

where $\text{sign}(w_S^*)$ is the vector of signs of the true nonzero coefficients.

The matrix $X_{S^c}^\top X_S (X_S^\top X_S)^{-1}$ measures how well the irrelevant features $X_{S^c}$ can be represented as linear combinations of the relevant features $X_S$ . The condition fails when irrelevant features are too correlated with the relevant ones. In that case, the lasso cannot distinguish signal from noise.

Zhao and Yu (2006) proved this is necessary for sign consistency: if the condition fails, there exists a sequence of true parameters for which the lasso selects the wrong support with probability approaching 1. This is a fundamental limitation, not a fixable algorithmic issue.

Comparison with Ridge

The distinction between lasso and ridge regression is geometric. Ridge uses the L2 penalty $\lambda\|w\|_2^2$ , whose constraint set is a ball. The lasso uses L1 with a diamond constraint. This difference has three consequences.

Variable selection. Ridge shrinks all coefficients toward zero but never sets any exactly to zero. Lasso sets coefficients to zero. For interpretation and feature selection, lasso wins. For prediction when all features are mildly relevant, ridge often wins.

Uniqueness. Ridge always has a unique solution: $(X^\top X + \lambda I)$ is invertible for $\lambda > 0$ . The lasso solution can be non-unique when $n < d$ or when features are perfectly correlated.

Bias-variance tradeoff. Lasso introduces more bias on large coefficients (soft thresholding subtracts a constant) but can have lower variance by eliminating irrelevant features. Ridge introduces less bias on large coefficients (multiplicative shrinkage) but retains variance from irrelevant features. See Ridge vs. Lasso for a detailed comparison.

Practical Tuning

Choosing Lambda via Cross-Validation

The regularization parameter $\lambda$ controls the sparsity level. The standard approach:

Define a grid of $\lambda$ values from $\lambda_{\max}$ (all zeros) down to $\lambda_{\max}/1000$ on a log scale. Typically 100 values.
For each $\lambda$ , run $K$ -fold cross-validation ( $K = 5$ or $10$ ). Compute mean squared prediction error on held-out folds.
Select $\lambda$ that minimizes CV error ( $\lambda_{\min}$ ), or use the "one-standard-error rule": select the largest $\lambda$ whose CV error is within one standard error of the minimum. The one-SE rule favors sparser models and often generalizes better.

Warm starts. Compute solutions along the $\lambda$ path from large to small. Use the solution at $\lambda_k$ as the starting point for $\lambda_{k+1}$ . This exploits the fact that the lasso path is piecewise linear, making the full path computation efficient.

Standardization. Always standardize features to have mean zero and unit variance before fitting the lasso. Otherwise, $\lambda$ penalizes each coefficient on different scales, and features measured in larger units are penalized more heavily.

The Lasso Path in Detail

The lasso path is the function $\hat{w}(\lambda)$ mapping each penalty value to the corresponding solution. It has three properties that make it computationally and statistically useful.

First, the path is piecewise linear. As $\lambda$ decreases from $\lambda_{\max}$ , coefficients enter (become nonzero) or exit (return to zero) at discrete breakpoints. Between breakpoints, each active coefficient changes linearly in $\lambda$ . This means the entire path can be computed by tracking the breakpoints, which is what the LARS algorithm does.

Second, features enter the path in order of their marginal correlation with the response. The first feature to become nonzero at $\lambda_{\max}$ is $\arg\max_j |x_j^\top y / n|$ . This is the feature most correlated with $y$ . As $\lambda$ decreases, the next feature to enter is the one most correlated with the current residual. This gives a natural ranking of feature importance.

Third, the path provides a stability diagnostic. If a feature enters the model at a large $\lambda$ (early), it is strongly associated with the response. If it enters only at a very small $\lambda$ (late), the association is weak. Features that enter and exit the path multiple times are unstable and should be interpreted cautiously.

Example

Lasso path for diabetes data

The classic diabetes dataset has 10 features and 442 observations. Running the lasso path from $\lambda_{\max}$ to $\lambda_{\max}/1000$ :

At $\lambda/\lambda_{\max} = 1.0$ : all coefficients zero. At $\lambda/\lambda_{\max} = 0.5$ : BMI enters first (strongest marginal correlation). At $\lambda/\lambda_{\max} = 0.3$ : blood pressure and one blood serum marker enter. At $\lambda/\lambda_{\max} = 0.05$ : 8 of 10 features are active. At $\lambda/\lambda_{\max} = 0.001$ : all 10 features are active, coefficients approach OLS.

Cross-validation selects $\lambda$ near $0.05 \cdot \lambda_{\max}$ , keeping 8 features and dropping the 2 with weakest signal. The one-SE rule selects a slightly larger $\lambda$ with 6 features.

Coordinate Descent

In practice, the dominant algorithm for lasso is coordinate descent, not ISTA. Each coordinate update has the closed form $\hat{w}_j \leftarrow S_{\lambda}(z_j)$ where $z_j$ is the partial residual. Cycling through coordinates until convergence is faster than proximal gradient methods for moderate-dimensional problems and is the algorithm used in the glmnet package.

The algorithm works as follows. Fix all coefficients except $w_j$ . The partial residual is $r_j = y - X_{-j} w_{-j}$ , where $X_{-j}$ is $X$ with column $j$ removed. The lasso objective in $w_j$ alone reduces to $\min_{w_j} \frac{1}{2n}\|r_j - x_j w_j\|_2^2 + \lambda |w_j|$ . Setting the subgradient to zero gives $(x_j^\top x_j / n)\, w_j + \lambda\, \partial|w_j| \ni x_j^\top r_j / n$ , whose closed-form solution is

$\hat{w}_j = \frac{S_\lambda(x_j^\top r_j / n)}{x_j^\top x_j / n}.$

With standardized features ( $x_j^\top x_j = n$ ) this simplifies to $\hat{w}_j = S_\lambda(x_j^\top r_j / n)$ , the form used inside glmnet. One full cycle through all $d$ coordinates is one iteration. Convergence is typically fast because the active set stabilizes quickly: after a few cycles, most coordinates that will be zero are already zero, and the algorithm only updates the active set.

When Lasso Fails

Highly correlated features: If $x_j \approx x_k$ , lasso picks one arbitrarily and zeros the other. The selected feature is unstable across subsamples. Use elastic net instead.
$n < d$ with more than $n$ relevant features: Lasso can select at most $n$ features. If the true model has more than $n$ nonzero coefficients, lasso cannot recover all of them.
Non-sparse truth: If the true $w^*$ has many small nonzero coefficients (not truly sparse), lasso may have higher MSE than ridge. Ridge is better for "dense" truth.
Irrepresentable condition violated: When irrelevant features are too correlated with relevant ones, lasso cannot recover the correct support, regardless of sample size.

Canonical Examples

Example

Lasso vs ridge on a sparse problem

Suppose $d = 100$ features but only $5$ have nonzero coefficients. With $n = 50$ samples:

Lasso sets roughly 45-95 coefficients to zero, identifying the important features. Prediction error is good.
Ridge shrinks all 100 coefficients, keeping all nonzero. It "spreads" the signal across correlated features instead of selecting. Higher prediction error in the sparse setting.

The advantage reverses if all 100 features are mildly relevant.

Example

Soft thresholding visualization

With orthonormal design, OLS coefficient $\hat{w}_j^{\text{OLS}} = 2.3$ and $\lambda = 1.0$ :

Lasso: $S_1(2.3) = \text{sign}(2.3) \cdot (|2.3| - 1)_+ = 1.3$
Ridge: $2.3 / (1 + 1) = 1.15$

For $\hat{w}_j^{\text{OLS}} = 0.7$ and $\lambda = 1.0$ :

Lasso: $S_1(0.7) = 0$ (set exactly to zero)
Ridge: $0.7 / 2 = 0.35$ (still nonzero)

The lasso kills small coefficients; ridge only dampens them.

Common Confusions

Watch Out

Lasso does variable selection, ridge does not

This is the single most important distinction between lasso and ridge. The L1 penalty has non-smooth corners at zero, creating a region where the optimality conditions are satisfied with $w_j = 0$ . The L2 penalty is smooth everywhere, so the only solution to the ridge optimality conditions with $w_j = 0$ requires the data to have zero correlation with feature $j$ . which generically does not happen.

Watch Out

Lasso has no closed form in general

Unlike ridge ( $\hat{w} = (X^\top X + \lambda I)^{-1} X^\top y$ ), the lasso has a closed form only when $X^\top X$ is diagonal (orthonormal design). In general, you must use iterative algorithms (ISTA, coordinate descent, LARS). This makes the lasso computationally harder than ridge, though still convex and efficiently solvable.

Watch Out

The 1/(2n) scaling convention matters

Some sources define the lasso with $\frac{1}{2}\|y - Xw\|_2^2$ instead of $\frac{1}{2n}\|y - Xw\|_2^2$ . This changes the scale of $\lambda$ by a factor of $n$ . When comparing results across sources or software packages, always check the scaling convention.

Summary

Lasso objective: $\min_w \frac{1}{2n}\|y - Xw\|_2^2 + \lambda\|w\|_1$
L1 penalty drives coefficients exactly to zero (variable selection)
Geometric reason: L1 ball has corners on axes; ellipsoids hit corners
KKT: $w_j = 0$ when feature $j$ has correlation $< \lambda$ with residual
No closed form in general; use ISTA, FISTA, or coordinate descent
Orthonormal case: soft thresholding $S_\lambda(z) = \text{sign}(z)(|z| - \lambda)_+$
Elastic net = L1 + L2 fixes issues with correlated features
At most $\min(n, d)$ nonzero coefficients

Optional Deeper DetailLeast Angle Regression (LARS) and the piecewise-linear lasso pathShow

Advanced section adapted from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2nd ed., Springer 2009), §3.4.4 "Least Angle Regression," pp. 73-79, and Efron, Hastie, Johnstone, Tibshirani (2004), "Least Angle Regression," Annals of Statistics 32(2), 407-499.

The KKT conditions imply something stronger than they look: the lasso coefficient path $\lambda \mapsto \hat{w}(\lambda)$ is piecewise linear in $\lambda$ . The LARS algorithm exploits this to compute the full path with the same cost as a single OLS fit.

LARS algorithm (lasso variant):

Start at $\hat{w} = 0$ and $\lambda = \lambda_{\max} = \max_j |x_j^\top y / n|$ . Active set $\mathcal{A} = \{j^*\}$ where $j^*$ is the maximizer.
Move in the direction that maintains equicorrelation: all active predictors keep the same absolute correlation with the residual as $\lambda$ decreases. The move direction is the OLS step restricted to the active set.
Continue until either (a) a non-active predictor's correlation with the residual catches up to the active set, at which point it joins $\mathcal{A}$ , or (b) an active coefficient hits zero, at which point it leaves $\mathcal{A}$ (this case is what distinguishes the lasso path from plain LARS).
Repeat. Each event defines a "knot" in the regularization path; between knots, every active coefficient moves linearly in $\lambda$ .

The number of knots is at most $\min(n, d) + O(1)$ in practice (Mairal-Yu 2012 prove polynomial bounds; pathological cases can have more). Each knot requires an OLS solve on the current active set plus a sweep to find the next event, giving total cost $O(\min(n, d) \cdot nd)$ for the entire path — the same order as one OLS fit on the full design.

The piecewise-linearity statement. Between knots, the KKT conditions for the active coordinates give

$X_{\mathcal{A}}^\top (y - X_{\mathcal{A}} \hat{w}_{\mathcal{A}}) / n \;=\; \lambda \cdot \operatorname{sign}(\hat{w}_{\mathcal{A}}),$

a linear equation in $(\hat{w}_{\mathcal{A}}, \lambda)$ when $\operatorname{sign}(\hat{w}_{\mathcal{A}})$ is fixed. Solving:

$\hat{w}_{\mathcal{A}}(\lambda) \;=\; (X_{\mathcal{A}}^\top X_{\mathcal{A}})^{-1} (X_{\mathcal{A}}^\top y - n \lambda \cdot \operatorname{sign}(\hat{w}_{\mathcal{A}})).$

Each active coefficient is linear in $\lambda$ on the segment where the active set and signs are constant. The path is continuous across knots and the OLS estimate is recovered at $\lambda = 0$ if $d \le n$ .

Why this matters in practice. glmnet (Friedman-Hastie-Tibshirani 2010) uses coordinate descent rather than LARS for fitting, because the constant-factor speed is better in the typical $n \gg d$ case and it generalizes more easily to GLMs and grouped penalties. But LARS remains the canonical theoretical lens: it makes the geometry of the lasso path explicit, and it is what justifies computing solutions on a grid of $\lambda$ values by warm-starting from neighbouring values (the path is Lipschitz in $\lambda$ between knots).

Exercises

ExerciseCore

Problem

In the orthonormal design case ( $X^\top X/n = I$ ), derive the lasso solution $\hat{w}_j = S_\lambda(\hat{w}_j^{\text{OLS}})$ by solving the lasso optimization problem coordinate-by-coordinate.

ExerciseCore

Problem

Find $\lambda_{\max}$ , the smallest value of $\lambda$ for which the lasso solution is $\hat{w} = 0$ . Express it in terms of $X$ and $y$ .

ExerciseAdvanced

Problem

Show that when two features are identical ( $x_1 = x_2$ ), any lasso solution satisfies $\hat{w}_1 + \hat{w}_2 = c$ for some constant $c$ (the individual values are not uniquely determined). How does the elastic net resolve this?

Related Comparisons

Ridge vs. Lasso Regression

References

Canonical:

Tibshirani, "Regression Shrinkage and Selection via the Lasso," JRSS-B (1996)
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. §3.4 (shrinkage methods), §3.4.4 (LARS), §3.8 (computational considerations and the regularization path), §18.4 (lasso in high dimensions).
Efron et al., "Least Angle Regression," Annals of Statistics (2004)

Support recovery and theory:

Zhao & Yu, "On Model Selection Consistency of Lasso," JMLR (2006). Irrepresentable condition.
Wainwright, High-Dimensional Statistics (2019), Chapters 7-8

Current:

Hastie, Tibshirani, Wainwright, Statistical Learning with Sparsity (2015), Chapters 2-3, 5
Friedman, Hastie, Tibshirani, "Regularization Paths for GLMs via Coordinate Descent," J. Stat. Software (2010). The glmnet algorithm.

Next Topics

Building on lasso regression:

Elastic net: combining L1 and L2 for correlated features
Compressed sensing: sparse recovery beyond regression
High-dimensional statistics: theory of estimation when $d \gg n$

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Maximum A Posteriori (MAP) Estimationlayer 0B · tier 1
Convex Optimization Basicslayer 1 · tier 1
Linear Regressionlayer 1 · tier 1
Ridge Regressionlayer 1 · tier 1
Subgradients and Subdifferentialslayer 1 · tier 1

Derived topics

6

Ridge Resolventslayer 3 · tier 1
Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scalinglayer 4 · tier 1
Elastic Netlayer 2 · tier 2
High-Dimensional Covariance Estimationlayer 3 · tier 2
Sparse Recovery and Compressed Sensinglayer 4 · tier 3

+1 more on the derived-topics page.

Graph-backed continuations

Elastic Net Sparse Recovery and Compressed Sensing High-Dimensional Covariance Estimation Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scaling Symbolic Regression and Equation Discovery