Regression Methods
Elastic Net
Combining L1 and L2 penalties: elastic net gets sparsity from lasso and grouping stability from ridge, solving the failure mode where lasso arbitrarily selects among correlated features.
Prerequisites
Why This Matters
Ridge regression and lasso each solve half the regularization problem. Ridge shrinks all coefficients but never produces sparsity --- you keep every feature. Lasso produces sparsity but behaves erratically with correlated features: if two features are highly correlated, lasso arbitrarily picks one and ignores the other, and the choice is unstable (small data perturbations can switch which feature is selected).
Elastic net combines both penalties and gets both benefits: sparsity from the L1 term and grouping from the L2 term. When features are correlated, elastic net includes or excludes them as a group, giving stable and interpretable models.
Mental Model
Think of elastic net as a compromise between ridge and lasso. The L1 penalty creates a diamond-shaped constraint region (with corners that produce exact zeros), while the L2 penalty creates a circular constraint region (smooth, no corners). Elastic net's constraint region is a "rounded diamond" --- it has the corners of the diamond (enabling sparsity) but the curvature of the circle (enabling grouping and stability).
When two features are identical, lasso can put all the weight on either one (the solution is not unique). Elastic net splits the weight evenly between them, which is both more stable and more interpretable.
Formal Setup
Elastic Net
The elastic net estimator minimizes:
where controls sparsity (L1 penalty) and controls shrinkage (L2 penalty).
An equivalent parameterization uses a mixing parameter :
Here gives lasso, gives ridge, and gives elastic net. This parameterization is used by most software (scikit-learn, glmnet).
Naive Elastic Net vs. Corrected Elastic Net
The solution above is sometimes called the naive elastic net. Because the L2 penalty introduces extra shrinkage on top of the L1 selection, the coefficients are biased toward zero more than necessary.
The corrected elastic net rescales the coefficients:
This de-biasing step undoes the extra shrinkage from the L2 term, so the coefficient magnitudes are comparable to what lasso would produce for the selected features. In practice, the correction often improves prediction accuracy.
Why Not Just Lasso? The Grouping Problem
Consider two features and with correlation . Lasso's behavior:
- If (perfect correlation), lasso selects one and sets the other to exactly zero. The choice is arbitrary and unstable.
- If is close to 1, lasso tends to select one and heavily shrink the other. Small perturbations of the data can flip which feature is selected.
- With (more features than samples), lasso selects at most features, even if more are relevant.
These are not bugs but inherent limitations of the L1 geometry. The L1 ball has sharp corners, and the intersection of the constraint boundary with a flat (degenerate) direction in the loss surface can happen at many different corners.
Main Theorems
Elastic Net Grouping Effect
Statement
Let be the design matrix with columns standardized to unit norm. Let be the elastic net solution with . For any two features and , define (their sample correlation). If (same sign), then:
As the correlation , the right side goes to zero: highly correlated features receive nearly identical coefficients.
Intuition
The L2 penalty penalizes solutions where correlated features get very different coefficients. If , using costs more in L2 penalty than , because . The L2 term actively encourages spreading weight among correlated features.
Lasso alone has no such incentive --- its L1 penalty is for the first option and for the second, identical. This is why lasso does not group and elastic net does.
Proof Sketch
The KKT conditions for elastic net are:
where is a subgradient of the L1 norm ( if ).
If and have the same sign, then and:
By Cauchy-Schwarz:
Since (for unit-norm columns):
(The statement uses as a looser bound; is tighter.)
Why It Matters
The grouping effect is the main theoretical advantage of elastic net over lasso. It means that elastic net does not arbitrarily choose among correlated features --- it selects them as a group and assigns them similar coefficients. This is both more stable (the model does not change drastically with small data perturbations) and more interpretable (you see which groups of features matter, not a random subset).
In genomics, where genes in the same pathway are often correlated, the grouping effect is particularly valuable: elastic net identifies entire pathways rather than individual genes.
Failure Mode
The grouping effect requires . As , elastic net approaches lasso and the grouping effect vanishes. The bound also depends on , so for large response values, the grouping guarantee weakens. In practice, the grouping effect is most pronounced when is chosen to be comparable to .
Fitting Elastic Net: Coordinate Descent
Coordinate Descent for Elastic Net
Elastic net is typically fit using coordinate descent: optimize over one coefficient at a time while holding all others fixed.
For coordinate , the update is:
where is the "partial residual" projected onto feature , and is the soft-thresholding operator:
The soft-thresholding comes from the L1 term (lasso), and the denominator comes from the L2 term (ridge). Cycling through all coordinates until convergence gives the elastic net solution.
Coordinate descent is efficient because:
- Each update is (compute the partial residual and apply soft-thresholding)
- The entire path of solutions over a grid of values can be computed efficiently using warm starts (start from the previous solution)
- The
glmnetpackage implements this with highly optimized Fortran code
Choosing Hyperparameters
Elastic net has two hyperparameters (, or equivalently ). The standard approach:
- Fix (the mixing parameter) at a few values, e.g.,
- For each , use cross-validation over (the overall penalty strength) to find the best
- Select the pair with lowest CV error
In practice, is a common default that balances sparsity and grouping equally. Smaller (more ridge) is appropriate when many features are relevant; larger (more lasso) when you expect a sparse model.
Canonical Examples
Elastic net in the orthonormal design case
When (orthonormal columns), elastic net gives:
Compare: lasso gives (soft threshold, no denominator). Ridge gives (shrinkage, no thresholding).
Elastic net first soft-thresholds (setting small coefficients to zero), then shrinks the survivors (pulling them toward zero). It combines variable selection and coefficient shrinkage in a single step.
Correlated features: lasso vs. elastic net
Suppose where is small noise, and . The true coefficients are .
Lasso: might give or , depending on the noise realization. Both solutions have the same L1 penalty.
Elastic net: gives (after correction). The L2 penalty breaks the symmetry by preferring the equal-weight solution ().
Common Confusions
Elastic net is not just lasso plus ridge applied separately
Running lasso and ridge separately and averaging the results is not elastic net. Elastic net optimizes the L1 and L2 penalties jointly in a single objective. The interaction between the two penalties produces the grouping effect, which does not arise from averaging separate solutions.
The corrected elastic net matters for coefficient magnitude
The naive elastic net applies extra shrinkage from the L2 term, so coefficients are systematically smaller in magnitude than lasso coefficients. The rescaling corrects this, making the coefficient magnitudes comparable. Most software does this correction automatically, but it is important to know it exists.
alpha = 0 is ridge, alpha = 1 is lasso
In the parameterization, controls the type of regularization (L1 vs. L2 mix) and controls the amount. Students sometimes confuse the roles. Setting does not mean "half as much regularization" --- it means "equal balance between L1 and L2."
Summary
- Elastic net:
- Combines lasso (sparsity) and ridge (grouping, stability)
- Grouping effect: correlated features get similar coefficients (with bound )
- Lasso fails with correlated features (picks one arbitrarily); elastic net includes them as a group
- Coordinate descent with soft-thresholding is the standard fitting algorithm
- Corrected elastic net rescales by to undo extra shrinkage
- In form: is lasso, is ridge
- Choose by cross-validation, typically from a small grid
Exercises
Problem
In the orthonormal design case (), write down the elastic net solution for each coefficient in terms of the OLS estimate , , and . Verify that setting recovers lasso and recovers ridge.
Problem
Prove the grouping effect: if and are standardized to unit norm with correlation , and both and are nonzero with the same sign, then .
Problem
In the parameterization, suppose you run 5-fold cross-validation over a grid of 50 values for each of . How many models must be fit in total? If the design matrix is and coordinate descent takes passes over all features per value, what is the total computational cost?
References
Canonical:
- Zou & Hastie, "Regularization and Variable Selection via the Elastic Net" (JRSS-B, 2005)
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009), Chapter 3.4
Current:
- Hastie, Tibshirani, Wainwright, Statistical Learning with Sparsity (2015), Chapter 4
- Friedman, Hastie, Tibshirani, "Regularization Paths for Generalized
Linear Models via Coordinate Descent" (JSS, 2010) --- the
glmnetpaper
Next Topics
Building on elastic net:
- Bias-variance tradeoff: the general principle underlying all regularization
- Regularization theory: the broader framework connecting L1, L2, and other penalties
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Ridge RegressionLayer 2
- Linear RegressionLayer 1
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Differentiation in RnLayer 0A
- Convex Optimization BasicsLayer 1
- Lasso RegressionLayer 2