Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Regression Methods

Elastic Net

Combining L1 and L2 penalties: elastic net gets sparsity from lasso and grouping stability from ridge, solving the failure mode where lasso arbitrarily selects among correlated features.

CoreTier 2Stable~40 min

Why This Matters

Ridge regression and lasso each solve half the regularization problem. Ridge shrinks all coefficients but never produces sparsity --- you keep every feature. Lasso produces sparsity but behaves erratically with correlated features: if two features are highly correlated, lasso arbitrarily picks one and ignores the other, and the choice is unstable (small data perturbations can switch which feature is selected).

Elastic net combines both penalties and gets both benefits: sparsity from the L1 term and grouping from the L2 term. When features are correlated, elastic net includes or excludes them as a group, giving stable and interpretable models.

Mental Model

Think of elastic net as a compromise between ridge and lasso. The L1 penalty creates a diamond-shaped constraint region (with corners that produce exact zeros), while the L2 penalty creates a circular constraint region (smooth, no corners). Elastic net's constraint region is a "rounded diamond" --- it has the corners of the diamond (enabling sparsity) but the curvature of the circle (enabling grouping and stability).

When two features are identical, lasso can put all the weight on either one (the solution is not unique). Elastic net splits the weight evenly between them, which is both more stable and more interpretable.

Formal Setup

Definition

Elastic Net

The elastic net estimator minimizes:

w^enet=argminwRdyXw22+λ1w1+λ2w22\hat{w}_{\text{enet}} = \arg\min_{w \in \mathbb{R}^d} \|y - Xw\|_2^2 + \lambda_1 \|w\|_1 + \lambda_2 \|w\|_2^2

where λ1>0\lambda_1 > 0 controls sparsity (L1 penalty) and λ2>0\lambda_2 > 0 controls shrinkage (L2 penalty).

An equivalent parameterization uses a mixing parameter α[0,1]\alpha \in [0, 1]:

w^enet=argminwyXw22+λ ⁣[αw1+(1α)w22]\hat{w}_{\text{enet}} = \arg\min_{w} \|y - Xw\|_2^2 + \lambda\!\left[\alpha \|w\|_1 + (1 - \alpha)\|w\|_2^2\right]

Here α=1\alpha = 1 gives lasso, α=0\alpha = 0 gives ridge, and α(0,1)\alpha \in (0, 1) gives elastic net. This parameterization is used by most software (scikit-learn, glmnet).

Definition

Naive Elastic Net vs. Corrected Elastic Net

The solution above is sometimes called the naive elastic net. Because the L2 penalty introduces extra shrinkage on top of the L1 selection, the coefficients are biased toward zero more than necessary.

The corrected elastic net rescales the coefficients:

w^corrected=(1+λ2)w^enet\hat{w}_{\text{corrected}} = (1 + \lambda_2) \cdot \hat{w}_{\text{enet}}

This de-biasing step undoes the extra shrinkage from the L2 term, so the coefficient magnitudes are comparable to what lasso would produce for the selected features. In practice, the correction often improves prediction accuracy.

Why Not Just Lasso? The Grouping Problem

Consider two features xjx_j and xkx_k with correlation ρ=corr(xj,xk)1\rho = \text{corr}(x_j, x_k) \approx 1. Lasso's behavior:

  • If ρ=1|\rho| = 1 (perfect correlation), lasso selects one and sets the other to exactly zero. The choice is arbitrary and unstable.
  • If ρ|\rho| is close to 1, lasso tends to select one and heavily shrink the other. Small perturbations of the data can flip which feature is selected.
  • With p>np > n (more features than samples), lasso selects at most nn features, even if more are relevant.

These are not bugs but inherent limitations of the L1 geometry. The L1 ball has sharp corners, and the intersection of the constraint boundary with a flat (degenerate) direction in the loss surface can happen at many different corners.

Main Theorems

Theorem

Elastic Net Grouping Effect

Statement

Let XX be the design matrix with columns standardized to unit 2\ell_2 norm. Let w^\hat{w} be the elastic net solution with λ1,λ2>0\lambda_1, \lambda_2 > 0. For any two features jj and kk, define ρ=xjxk\rho = x_j^\top x_k (their sample correlation). If w^jw^k>0\hat{w}_j \cdot \hat{w}_k > 0 (same sign), then:

w^jw^ky12(1ρ)λ2|\hat{w}_j - \hat{w}_k| \leq \frac{\|y\|_1 \sqrt{2(1 - \rho)}}{\lambda_2}

As the correlation ρ1\rho \to 1, the right side goes to zero: highly correlated features receive nearly identical coefficients.

Intuition

The L2 penalty penalizes solutions where correlated features get very different coefficients. If xjxkx_j \approx x_k, using wj=c,wk=0w_j = c, w_k = 0 costs more in L2 penalty than wj=wk=c/2w_j = w_k = c/2, because c2>2(c/2)2=c2/2c^2 > 2(c/2)^2 = c^2/2. The L2 term actively encourages spreading weight among correlated features.

Lasso alone has no such incentive --- its L1 penalty is c|c| for the first option and 2c/2=c2|c/2| = |c| for the second, identical. This is why lasso does not group and elastic net does.

Proof Sketch

The KKT conditions for elastic net are:

xj(yXw^)+λ2w^j+λ1sj=0-x_j^\top(y - X\hat{w}) + \lambda_2 \hat{w}_j + \lambda_1 s_j = 0

where sjw^js_j \in \partial |\hat{w}_j| is a subgradient of the L1 norm (sj=sign(w^j)s_j = \text{sign}(\hat{w}_j) if w^j0\hat{w}_j \neq 0).

If w^j\hat{w}_j and w^k\hat{w}_k have the same sign, then sj=sks_j = s_k and:

xj(yXw^)xk(yXw^)=λ2(w^jw^k)x_j^\top(y - X\hat{w}) - x_k^\top(y - X\hat{w}) = \lambda_2(\hat{w}_j - \hat{w}_k)

(xjxk)(yXw^)=λ2(w^jw^k)(x_j - x_k)^\top(y - X\hat{w}) = \lambda_2(\hat{w}_j - \hat{w}_k)

By Cauchy-Schwarz:

λ2w^jw^kxjxk2yXw^2xjxk2y2\lambda_2 |\hat{w}_j - \hat{w}_k| \leq \|x_j - x_k\|_2 \cdot \|y - X\hat{w}\|_2 \leq \|x_j - x_k\|_2 \cdot \|y\|_2

Since xjxk22=22ρ\|x_j - x_k\|_2^2 = 2 - 2\rho (for unit-norm columns):

w^jw^ky22(1ρ)λ2|\hat{w}_j - \hat{w}_k| \leq \frac{\|y\|_2 \sqrt{2(1 - \rho)}}{\lambda_2}

(The statement uses y1\|y\|_1 as a looser bound; y2\|y\|_2 is tighter.)

Why It Matters

The grouping effect is the main theoretical advantage of elastic net over lasso. It means that elastic net does not arbitrarily choose among correlated features --- it selects them as a group and assigns them similar coefficients. This is both more stable (the model does not change drastically with small data perturbations) and more interpretable (you see which groups of features matter, not a random subset).

In genomics, where genes in the same pathway are often correlated, the grouping effect is particularly valuable: elastic net identifies entire pathways rather than individual genes.

Failure Mode

The grouping effect requires λ2>0\lambda_2 > 0. As λ20\lambda_2 \to 0, elastic net approaches lasso and the grouping effect vanishes. The bound also depends on y\|y\|, so for large response values, the grouping guarantee weakens. In practice, the grouping effect is most pronounced when λ2\lambda_2 is chosen to be comparable to λ1\lambda_1.

Fitting Elastic Net: Coordinate Descent

Definition

Coordinate Descent for Elastic Net

Elastic net is typically fit using coordinate descent: optimize over one coefficient wjw_j at a time while holding all others fixed.

For coordinate jj, the update is:

w^jS(w~j,λ1)1+λ2\hat{w}_j \leftarrow \frac{S(\tilde{w}_j, \lambda_1)}{1 + \lambda_2}

where w~j=xj(yXjw^j)\tilde{w}_j = x_j^\top(y - X_{-j}\hat{w}_{-j}) is the "partial residual" projected onto feature jj, and SS is the soft-thresholding operator:

S(z,λ)=sign(z)max(zλ,0)={zλif z>λ0if zλz+λif z<λS(z, \lambda) = \text{sign}(z) \max(|z| - \lambda, 0) = \begin{cases} z - \lambda & \text{if } z > \lambda \\ 0 & \text{if } |z| \leq \lambda \\ z + \lambda & \text{if } z < -\lambda \end{cases}

The soft-thresholding comes from the L1 term (lasso), and the 1/(1+λ2)1/(1 + \lambda_2) denominator comes from the L2 term (ridge). Cycling through all coordinates until convergence gives the elastic net solution.

Coordinate descent is efficient because:

  • Each update is O(n)O(n) (compute the partial residual and apply soft-thresholding)
  • The entire path of solutions over a grid of λ1\lambda_1 values can be computed efficiently using warm starts (start from the previous solution)
  • The glmnet package implements this with highly optimized Fortran code

Choosing Hyperparameters

Elastic net has two hyperparameters (λ1,λ2\lambda_1, \lambda_2, or equivalently λ,α\lambda, \alpha). The standard approach:

  1. Fix α\alpha (the mixing parameter) at a few values, e.g., α{0.1,0.25,0.5,0.75,0.9}\alpha \in \{0.1, 0.25, 0.5, 0.75, 0.9\}
  2. For each α\alpha, use cross-validation over λ\lambda (the overall penalty strength) to find the best λ\lambda
  3. Select the (α,λ)(\alpha, \lambda) pair with lowest CV error

In practice, α=0.5\alpha = 0.5 is a common default that balances sparsity and grouping equally. Smaller α\alpha (more ridge) is appropriate when many features are relevant; larger α\alpha (more lasso) when you expect a sparse model.

Canonical Examples

Example

Elastic net in the orthonormal design case

When XX=IX^\top X = I (orthonormal columns), elastic net gives:

w^j=S(w^jOLS,λ1/2)1+λ2\hat{w}_j = \frac{S(\hat{w}_j^{\text{OLS}}, \lambda_1/2)}{1 + \lambda_2}

Compare: lasso gives S(w^jOLS,λ1/2)S(\hat{w}_j^{\text{OLS}}, \lambda_1/2) (soft threshold, no denominator). Ridge gives w^jOLS/(1+λ2)\hat{w}_j^{\text{OLS}}/(1 + \lambda_2) (shrinkage, no thresholding).

Elastic net first soft-thresholds (setting small coefficients to zero), then shrinks the survivors (pulling them toward zero). It combines variable selection and coefficient shrinkage in a single step.

Example

Correlated features: lasso vs. elastic net

Suppose x1=x2+ϵx_1 = x_2 + \epsilon where ϵ\epsilon is small noise, and y=x1+x2+noisey = x_1 + x_2 + \text{noise}. The true coefficients are w1=w2=1w_1 = w_2 = 1.

Lasso: might give w^1=2,w^2=0\hat{w}_1 = 2, \hat{w}_2 = 0 or w^1=0,w^2=2\hat{w}_1 = 0, \hat{w}_2 = 2, depending on the noise realization. Both solutions have the same L1 penalty.

Elastic net: gives w^1w^21\hat{w}_1 \approx \hat{w}_2 \approx 1 (after correction). The L2 penalty breaks the symmetry by preferring the equal-weight solution (12+12=2<4=22+021^2 + 1^2 = 2 < 4 = 2^2 + 0^2).

Common Confusions

Watch Out

Elastic net is not just lasso plus ridge applied separately

Running lasso and ridge separately and averaging the results is not elastic net. Elastic net optimizes the L1 and L2 penalties jointly in a single objective. The interaction between the two penalties produces the grouping effect, which does not arise from averaging separate solutions.

Watch Out

The corrected elastic net matters for coefficient magnitude

The naive elastic net applies extra shrinkage from the L2 term, so coefficients are systematically smaller in magnitude than lasso coefficients. The (1+λ2)(1 + \lambda_2) rescaling corrects this, making the coefficient magnitudes comparable. Most software does this correction automatically, but it is important to know it exists.

Watch Out

alpha = 0 is ridge, alpha = 1 is lasso

In the (λ,α)(\lambda, \alpha) parameterization, α\alpha controls the type of regularization (L1 vs. L2 mix) and λ\lambda controls the amount. Students sometimes confuse the roles. Setting α=0.5\alpha = 0.5 does not mean "half as much regularization" --- it means "equal balance between L1 and L2."

Summary

  • Elastic net: minyXw2+λ1w1+λ2w22\min \|y - Xw\|^2 + \lambda_1\|w\|_1 + \lambda_2\|w\|_2^2
  • Combines lasso (sparsity) and ridge (grouping, stability)
  • Grouping effect: correlated features get similar coefficients (with bound 1ρ/λ2\propto \sqrt{1 - \rho}/\lambda_2)
  • Lasso fails with correlated features (picks one arbitrarily); elastic net includes them as a group
  • Coordinate descent with soft-thresholding is the standard fitting algorithm
  • Corrected elastic net rescales by (1+λ2)(1 + \lambda_2) to undo extra shrinkage
  • In (λ,α)(\lambda, \alpha) form: α=1\alpha = 1 is lasso, α=0\alpha = 0 is ridge
  • Choose α\alpha by cross-validation, typically from a small grid

Exercises

ExerciseCore

Problem

In the orthonormal design case (XX=IX^\top X = I), write down the elastic net solution for each coefficient wjw_j in terms of the OLS estimate w^jOLS\hat{w}_j^{\text{OLS}}, λ1\lambda_1, and λ2\lambda_2. Verify that setting λ2=0\lambda_2 = 0 recovers lasso and λ1=0\lambda_1 = 0 recovers ridge.

ExerciseAdvanced

Problem

Prove the grouping effect: if xjx_j and xkx_k are standardized to unit norm with correlation ρ=xjxk\rho = x_j^\top x_k, and both w^j\hat{w}_j and w^k\hat{w}_k are nonzero with the same sign, then w^jw^ky22(1ρ)λ2|\hat{w}_j - \hat{w}_k| \leq \frac{\|y\|_2 \sqrt{2(1-\rho)}}{\lambda_2}.

ExerciseAdvanced

Problem

In the (λ,α)(\lambda, \alpha) parameterization, suppose you run 5-fold cross-validation over a grid of 50 λ\lambda values for each of α{0.1,0.5,0.9}\alpha \in \{0.1, 0.5, 0.9\}. How many models must be fit in total? If the design matrix is 1000×2001000 \times 200 and coordinate descent takes kk passes over all features per λ\lambda value, what is the total computational cost?

References

Canonical:

  • Zou & Hastie, "Regularization and Variable Selection via the Elastic Net" (JRSS-B, 2005)
  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009), Chapter 3.4

Current:

  • Hastie, Tibshirani, Wainwright, Statistical Learning with Sparsity (2015), Chapter 4
  • Friedman, Hastie, Tibshirani, "Regularization Paths for Generalized Linear Models via Coordinate Descent" (JSS, 2010) --- the glmnet paper

Next Topics

Building on elastic net:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics