Elastic Net

Sneiderman, Robby

ML Methods

Elastic Net

Combining L1 and L2 penalties: elastic net gets sparsity from lasso and grouping stability from ridge, solving the failure mode where lasso arbitrarily selects among correlated features.

CoreTier 2StableSupporting~40 min

Prerequisites

Ridge Regression Lasso Regression

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 2. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Bias-Variance Tradeoff

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Ridge regression and lasso each solve half the regularization problem. Ridge shrinks all coefficients but never produces sparsity --- you keep every feature. Lasso produces sparsity but behaves erratically with correlated features: if two features are highly correlated, lasso arbitrarily picks one and ignores the other, and the choice is unstable (small data perturbations can switch which feature is selected).

Elastic net combines both penalties and gets both benefits: sparsity from the L1 term and grouping from the L2 term. When features are correlated, elastic net includes or excludes them as a group, giving stable and interpretable models.

Mental Model

Think of elastic net as a compromise between ridge and lasso. The L1 penalty creates a diamond-shaped constraint region (with corners that produce exact zeros), while the L2 penalty creates a circular constraint region (smooth, no corners). Elastic net's constraint region is a "rounded diamond" --- it has the corners of the diamond (enabling sparsity) but the curvature of the circle (enabling grouping and stability).

When two features are identical, lasso can put all the weight on either one (the solution is not unique). Elastic net splits the weight evenly between them, which is both more stable and more interpretable.

Formal Setup

Definition

Elastic Net $\overset{w}{^}_{enet}$

The elastic net estimator minimizes:

$\hat{w}_{\text{enet}} = \arg\min_{w \in \mathbb{R}^d} \|y - Xw\|_2^2 + \lambda_1 \|w\|_1 + \lambda_2 \|w\|_2^2$

where $\lambda_1 > 0$ controls sparsity (L1 penalty) and $\lambda_2 > 0$ controls shrinkage (L2 penalty).

An equivalent parameterization uses a mixing parameter $\alpha \in [0, 1]$ :

$\hat{w}_{\text{enet}} = \arg\min_{w} \|y - Xw\|_2^2 + \lambda\!\left[\alpha \|w\|_1 + (1 - \alpha)\|w\|_2^2\right]$

Here $\alpha = 1$ gives lasso, $\alpha = 0$ gives ridge, and $\alpha \in (0, 1)$ gives elastic net. This parameterization is used by most software (scikit-learn, glmnet).

Definition

Naive Elastic Net vs. Corrected Elastic Net

The solution above is sometimes called the naive elastic net. Because the L2 penalty introduces extra shrinkage on top of the L1 selection, the coefficients are biased toward zero more than necessary.

The corrected elastic net rescales the coefficients:

$\hat{w}_{\text{corrected}} = (1 + \lambda_2) \cdot \hat{w}_{\text{enet}}$

This de-biasing step undoes the extra shrinkage from the L2 term, so the coefficient magnitudes are comparable to what lasso would produce for the selected features. In practice, the correction often improves prediction accuracy.

Why Not Just Lasso? The Grouping Problem

Consider two features $x_j$ and $x_k$ with correlation $\rho = \text{corr}(x_j, x_k) \approx 1$ . Lasso's behavior:

If $|\rho| = 1$ (perfect correlation), lasso selects one and sets the other to exactly zero. The choice is arbitrary and unstable.
If $|\rho|$ is close to 1, lasso tends to select one and heavily shrink the other. Small perturbations of the data can flip which feature is selected.
With $p > n$ (more features than samples), lasso selects at most $n$ features, even if more are relevant.

These are not bugs but inherent limitations of the L1 geometry. The L1 ball has sharp corners, and the intersection of the constraint boundary with a flat (degenerate) direction in the loss surface can happen at many different corners.

Main Theorems

Theorem

Elastic Net Grouping Effect

Statement

Let $X$ be the design matrix with columns standardized to unit $\ell_2$ norm. Let $\hat{w}$ be the elastic net solution with $\lambda_1, \lambda_2 > 0$ . For any two features $j$ and $k$ , define $\rho = x_j^\top x_k$ (their sample correlation). If $\hat{w}_j \cdot \hat{w}_k > 0$ (same sign), then:

$|\hat{w}_j - \hat{w}_k| \leq \frac{\|y\|_1 \sqrt{2(1 - \rho)}}{\lambda_2}$

As the correlation $\rho \to 1$ , the right side goes to zero: highly correlated features receive nearly identical coefficients.

Intuition

The L2 penalty penalizes solutions where correlated features get very different coefficients. If $x_j \approx x_k$ , using $w_j = c, w_k = 0$ costs more in L2 penalty than $w_j = w_k = c/2$ , because $c^2 > 2(c/2)^2 = c^2/2$ . The L2 term actively encourages spreading weight among correlated features.

Lasso alone has no such incentive --- its L1 penalty is $|c|$ for the first option and $2|c/2| = |c|$ for the second, identical. This is why lasso does not group and elastic net does.

Proof Sketch

The KKT conditions for elastic net are:

$-x_j^\top(y - X\hat{w}) + \lambda_2 \hat{w}_j + \lambda_1 s_j = 0$

where $s_j \in \partial |\hat{w}_j|$ is a subgradient of the L1 norm ( $s_j = \text{sign}(\hat{w}_j)$ if $\hat{w}_j \neq 0$ ).

If $\hat{w}_j$ and $\hat{w}_k$ have the same sign, then $s_j = s_k$ and:

$x_j^\top(y - X\hat{w}) - x_k^\top(y - X\hat{w}) = \lambda_2(\hat{w}_j - \hat{w}_k)$

$(x_j - x_k)^\top(y - X\hat{w}) = \lambda_2(\hat{w}_j - \hat{w}_k)$

By Cauchy-Schwarz:

$\lambda_2 |\hat{w}_j - \hat{w}_k| \leq \|x_j - x_k\|_2 \cdot \|y - X\hat{w}\|_2 \leq \|x_j - x_k\|_2 \cdot \|y\|_2$

Since $\|x_j - x_k\|_2^2 = 2 - 2\rho$ (for unit-norm columns):

$|\hat{w}_j - \hat{w}_k| \leq \frac{\|y\|_2 \sqrt{2(1 - \rho)}}{\lambda_2}$

(The statement uses $\|y\|_1$ as a looser bound; $\|y\|_2$ is tighter.)

Why It Matters

The grouping effect is the main theoretical advantage of elastic net over lasso. It means that elastic net does not arbitrarily choose among correlated features --- it selects them as a group and assigns them similar coefficients. This is both more stable (the model does not change drastically with small data perturbations) and more interpretable (you see which groups of features matter, not a random subset).

In genomics, where genes in the same pathway are often correlated, the grouping effect is particularly valuable: elastic net identifies entire pathways rather than individual genes.

Failure Mode

The grouping effect requires $\lambda_2 > 0$ . As $\lambda_2 \to 0$ , elastic net approaches lasso and the grouping effect vanishes. The bound also depends on $\|y\|$ , so for large response values, the grouping guarantee weakens. In practice, the grouping effect is most pronounced when $\lambda_2$ is chosen to be comparable to $\lambda_1$ .

report a correction →

Fitting Elastic Net: Coordinate Descent

Definition

Coordinate Descent for Elastic Net

Elastic net is typically fit using coordinate descent: optimize over one coefficient $w_j$ at a time while holding all others fixed.

For coordinate $j$ , the update is:

$\hat{w}_j \leftarrow \frac{S(\tilde{w}_j, \lambda_1)}{1 + \lambda_2}$

where $\tilde{w}_j = x_j^\top(y - X_{-j}\hat{w}_{-j})$ is the "partial residual" projected onto feature $j$ , and $S$ is the soft-thresholding operator:

$S(z, \lambda) = \text{sign}(z) \max(|z| - \lambda, 0) = \begin{cases} z - \lambda & \text{if } z > \lambda \\ 0 & \text{if } |z| \leq \lambda \\ z + \lambda & \text{if } z < -\lambda \end{cases}$

The soft-thresholding comes from the L1 term (lasso), and the $1/(1 + \lambda_2)$ denominator comes from the L2 term (ridge). Cycling through all coordinates until convergence gives the elastic net solution.

Coordinate descent is efficient because:

Each update is $O(n)$ (compute the partial residual and apply soft-thresholding)
The entire path of solutions over a grid of $\lambda_1$ values can be computed efficiently using warm starts (start from the previous solution)
The glmnet package implements this with highly optimized Fortran code

Choosing Hyperparameters

Elastic net has two hyperparameters ( $\lambda_1, \lambda_2$ , or equivalently $\lambda, \alpha$ ). The standard approach:

Fix $\alpha$ (the mixing parameter) at a few values, e.g., $\alpha \in \{0.1, 0.25, 0.5, 0.75, 0.9\}$
For each $\alpha$ , use cross-validation over $\lambda$ (the overall penalty strength) to find the best $\lambda$
Select the $(\alpha, \lambda)$ pair with lowest CV error

In practice, $\alpha = 0.5$ is a common default that balances sparsity and grouping equally. Smaller $\alpha$ (more ridge) is appropriate when many features are relevant; larger $\alpha$ (more lasso) when you expect a sparse model.

Canonical Examples

Example

Elastic net in the orthonormal design case

When $X^\top X = I$ (orthonormal columns), elastic net gives:

$\hat{w}_j = \frac{S(\hat{w}_j^{\text{OLS}}, \lambda_1/2)}{1 + \lambda_2}$

Compare: lasso gives $S(\hat{w}_j^{\text{OLS}}, \lambda_1/2)$ (soft threshold, no denominator). Ridge gives $\hat{w}_j^{\text{OLS}}/(1 + \lambda_2)$ (shrinkage, no thresholding).

Elastic net first soft-thresholds (setting small coefficients to zero), then shrinks the survivors (pulling them toward zero). It combines variable selection and coefficient shrinkage in a single step.

Example

Correlated features: lasso vs. elastic net

Suppose $x_1 = x_2 + \epsilon$ where $\epsilon$ is small noise, and $y = x_1 + x_2 + \text{noise}$ . The true coefficients are $w_1 = w_2 = 1$ .

Lasso: might give $\hat{w}_1 = 2, \hat{w}_2 = 0$ or $\hat{w}_1 = 0, \hat{w}_2 = 2$ , depending on the noise realization. Both solutions have the same L1 penalty.

Elastic net: gives $\hat{w}_1 \approx \hat{w}_2 \approx 1$ (after correction). The L2 penalty breaks the symmetry by preferring the equal-weight solution ( $1^2 + 1^2 = 2 < 4 = 2^2 + 0^2$ ).

Common Confusions

Watch Out

Elastic net is not just lasso plus ridge applied separately

Running lasso and ridge separately and averaging the results is not elastic net. Elastic net optimizes the L1 and L2 penalties jointly in a single objective. The interaction between the two penalties produces the grouping effect, which does not arise from averaging separate solutions.

Watch Out

The corrected elastic net matters for coefficient magnitude

The naive elastic net applies extra shrinkage from the L2 term, so coefficients are systematically smaller in magnitude than lasso coefficients. The $(1 + \lambda_2)$ rescaling corrects this, making the coefficient magnitudes comparable. Most software does this correction automatically, but it is important to know it exists.

Watch Out

alpha = 0 is ridge, alpha = 1 is lasso

In the $(\lambda, \alpha)$ parameterization, $\alpha$ controls the type of regularization (L1 vs. L2 mix) and $\lambda$ controls the amount. Students sometimes confuse the roles. Setting $\alpha = 0.5$ does not mean "half as much regularization" --- it means "equal balance between L1 and L2."

Summary

Elastic net: $\min \|y - Xw\|^2 + \lambda_1\|w\|_1 + \lambda_2\|w\|_2^2$
Combines lasso (sparsity) and ridge (grouping, stability)
Grouping effect: correlated features get similar coefficients (with bound $\propto \sqrt{1 - \rho}/\lambda_2$ )
Lasso fails with correlated features (picks one arbitrarily); elastic net includes them as a group
Coordinate descent with soft-thresholding is the standard fitting algorithm
Corrected elastic net rescales by $(1 + \lambda_2)$ to undo extra shrinkage
In $(\lambda, \alpha)$ form: $\alpha = 1$ is lasso, $\alpha = 0$ is ridge
Choose $\alpha$ by cross-validation, typically from a small grid

Exercises

ExerciseCore

Problem

In the orthonormal design case ( $X^\top X = I$ ), write down the elastic net solution for each coefficient $w_j$ in terms of the OLS estimate $\hat{w}_j^{\text{OLS}}$ , $\lambda_1$ , and $\lambda_2$ . Verify that setting $\lambda_2 = 0$ recovers lasso and $\lambda_1 = 0$ recovers ridge.

ExerciseAdvanced

Problem

Prove the grouping effect: if $x_j$ and $x_k$ are standardized to unit norm with correlation $\rho = x_j^\top x_k$ , and both $\hat{w}_j$ and $\hat{w}_k$ are nonzero with the same sign, then $|\hat{w}_j - \hat{w}_k| \leq \frac{\|y\|_2 \sqrt{2(1-\rho)}}{\lambda_2}$ .

ExerciseAdvanced

Problem

In the $(\lambda, \alpha)$ parameterization, suppose you run 5-fold cross-validation over a grid of 50 $\lambda$ values for each of $\alpha \in \{0.1, 0.5, 0.9\}$ . How many models must be fit in total? If the design matrix is $1000 \times 200$ and coordinate descent takes $k$ passes over all features per $\lambda$ value, what is the total computational cost?

References

Canonical:

Zou & Hastie, "Regularization and Variable Selection via the Elastic Net" (JRSS-B, 2005)
Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009), Chapter 3.4

Current:

Hastie, Tibshirani, Wainwright, Statistical Learning with Sparsity (2015), Chapter 4
Friedman, Hastie, Tibshirani, "Regularization Paths for Generalized Linear Models via Coordinate Descent" (JSS, 2010) --- the glmnet paper

Next Topics

Building on elastic net:

Bias-variance tradeoff: the general principle underlying all regularization
Regularization theory: the broader framework connecting L1, L2, and other penalties

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Ridge Regressionlayer 1 · tier 1
Lasso Regressionlayer 2 · tier 1

Derived topics

2

Bias-Variance Tradeofflayer 2 · tier 2
Regularization Theorylayer 2 · tier 2

Graph-backed continuations

Bias-Variance Tradeoff Regularization Theory