Optimization Function Classes
Regularization Theory
Why unconstrained ERM overfits and how regularization controls complexity: Tikhonov (L2), sparsity (L1), elastic net, early stopping, dropout, the Bayesian prior connection, and the link to algorithmic stability.
Prerequisites
Why This Matters
Empirical risk minimization on a rich hypothesis class overfits: it finds a function that perfectly fits the training data, including its noise. Regularization is the primary tool for preventing this. Every practical ML algorithm uses regularization in some form -- explicit penalty terms, early stopping, dropout, data augmentation, or architectural constraints.
Understanding regularization theory tells you why these techniques work: they all control the complexity of the learned hypothesis, trading a small increase in bias for a large decrease in variance.
Mental Model
ERM minimizes -- the average training loss. If is too rich, the minimizer overfits. Regularization adds a penalty:
The penalty measures the complexity of . The regularization parameter controls the tradeoff:
- Large : strong penalty, simple models, high bias, low variance
- Small : weak penalty, complex models, low bias, high variance
The optimal balances the bias-variance tradeoff to minimize total risk.
Tikhonov Regularization (L2 / Ridge)
Ridge Regression
For linear regression with squared loss, ridge regression solves:
The penalty shrinks all coefficients toward zero without setting any exactly to zero.
Ridge Regression Closed Form
Statement
The ridge regression solution is:
In terms of the SVD :
where are the singular values.
Intuition
Ridge regression shrinks each coefficient by a factor . Directions with large singular values (well-determined by the data) are barely shrunk. Directions with small singular values (poorly determined) are shrunk heavily. This is exactly what you want: trust the data where it is informative, shrink toward zero where it is not.
Proof Sketch
Take the gradient of the ridge objective and set it to zero:
Solving: . The matrix is always invertible for , even when is singular (more features than samples).
Why It Matters
Ridge regression resolves the ill-conditioning problem of OLS when or . The addition of ensures the matrix is invertible. The SVD form shows exactly how regularization acts: it is a continuous shrinkage of the OLS coefficients, with more shrinkage in poorly determined directions.
Failure Mode
Ridge never produces exactly sparse solutions. If the true model is sparse (few nonzero coefficients), ridge will keep all coefficients nonzero, which hurts interpretability and, in high-dimensional settings, can lead to worse prediction.
Sparsity and L1 Regularization (Lasso)
Lasso
The Lasso (Least Absolute Shrinkage and Selection Operator) solves:
where . The L1 penalty produces sparse solutions: many coefficients are set exactly to zero.
Lasso Produces Sparse Solutions
Statement
The Lasso solution is sparse: for sufficiently large , a subset of the coefficients are exactly zero. The set of nonzero coefficients is called the active set or support.
Intuition
The L1 penalty creates a diamond-shaped constraint region in parameter space. The corners of the diamond lie on the coordinate axes. The loss function (an ellipsoid for squared loss) typically first touches the diamond at a corner, where some coordinates are exactly zero. The L2 penalty creates a spherical constraint region with no corners, so the tangent point has all coordinates nonzero.
Proof Sketch
The subdifferential optimality condition for the Lasso is:
where : if , and if .
If , the only solution is (the gradient is not large enough to "escape" zero). This is the soft-thresholding mechanism that produces sparsity.
Why It Matters
Sparsity is a powerful inductive bias for high-dimensional problems where . If only features are relevant, the Lasso can identify them and achieve the rate , which depends on the sparsity and only logarithmically on the total dimension . Without sparsity, consistent estimation in the regime is impossible.
Failure Mode
The Lasso performs poorly when features are highly correlated (the "collinearity" problem). With correlated features, the Lasso tends to select one and zero out the others, even if all are relevant. Elastic net addresses this.
Elastic Net
Elastic Net
The elastic net combines L1 and L2 penalties:
This produces sparse solutions (from L1) while handling correlated features gracefully (from L2). When correlated features are present, elastic net tends to select them as a group rather than picking one arbitrarily.
The Bayesian Interpretation
There is a deep and exact correspondence between regularization penalties and Bayesian priors:
L2 regularization = Gaussian prior: Minimizing is equivalent to finding the MAP (maximum a posteriori) estimate under the prior with Gaussian likelihood.
L1 regularization = Laplace prior: Minimizing is equivalent to MAP estimation with a Laplace prior . The sharp peak of the Laplace distribution at zero is what produces sparsity.
MAP Estimation and Regularization
The MAP estimate is:
With Gaussian likelihood , adding a prior is exactly adding a regularization penalty:
- : gives (ridge)
- : gives (lasso)
The regularization strength (for ridge) connects the noise variance and prior variance.
Implicit Regularization
Not all regularization comes from explicit penalty terms.
Early stopping: training a model with gradient descent and stopping before convergence is a form of regularization. In linear regression, gradient descent with step size after steps produces a solution equivalent to ridge regression with . Fewer iterations means stronger regularization.
Dropout: randomly zeroing out neurons during training is equivalent (in expectation for linear models) to an adaptive L2 penalty on the weights. For nonlinear networks, dropout acts as an approximate Bayesian ensemble.
Data augmentation: training on augmented data is equivalent to adding a regularization term that encourages invariance to the augmentation transformations.
Batch normalization, weight decay, finite learning rate: all induce implicit regularization effects that can be analyzed formally in specific settings.
Bias-Variance Tradeoff with Lambda
The regularization strength directly controls the bias-variance tradeoff:
For ridge regression:
- Bias: -- increases with
- Variance: -- decreases with
The optimal minimizes the total risk. In practice, is chosen by cross-validation.
Cross-Validation for Choosing Lambda
K-Fold Cross-Validation for Lambda
To choose by K-fold cross-validation:
- Split the training data into folds
- For each candidate and each fold :
- Train on all folds except with regularization
- Evaluate the loss on fold
- Average the loss over all folds for each
- Choose minimizing the average cross-validation loss
Standard choices: or . Leave-one-out () has a closed-form solution for ridge regression.
Connection to Algorithmic Stability
Regularization has a formal connection to stability:
A learning algorithm is -uniformly stable if replacing one training example changes the predictions by at most (in a suitable norm). Strong regularization makes algorithms more stable, because the penalty prevents the solution from depending too sensitively on any single data point.
For ridge regression with parameter , the uniform stability is . Strong regularization (large ) gives small (high stability), which in turn gives tight generalization bounds.
This creates a clean theoretical chain: regularization implies stability implies generalization.
Canonical Examples
Ridge vs. OLS when p is close to n
Consider observations and features. OLS overfits badly because is nearly singular (condition number is huge). Ridge with adds to , stabilizing the inversion. The SVD form shows that the 90th singular value (nearly zero for OLS) gets shrunk heavily, while the first few large singular values are barely affected. Test MSE drops dramatically.
Lasso for feature selection
In a genomics dataset with patients and genes, most genes are irrelevant. Lasso with cross-validated selects 15 genes (nonzero coefficients) and zeros out the rest. The selected genes are interpretable biomarkers. Ridge regression would keep all 10,000 genes with small but nonzero coefficients, losing interpretability.
Common Confusions
Regularization does not always mean a penalty term
Early stopping, dropout, data augmentation, and even using a small network are all forms of regularization. The common thread is restricting the effective complexity of the learned function, not the specific mechanism. The explicit penalty is just the most mathematically tractable form.
L1 does not always outperform L2
L1 (lasso) is better when the true model is sparse. L2 (ridge) is better when all features contribute roughly equally with small coefficients. In practice, you do not know which setting you are in, which is why elastic net (L1 + L2) and cross-validation are the safe defaults.
Weight decay is not exactly L2 regularization with Adam
For SGD, weight decay () and L2 regularization () are equivalent. For adaptive optimizers like Adam, they differ because the adaptive learning rate rescales the gradient and the regularization term differently. "Decoupled weight decay" (AdamW) applies weight decay directly, which is the correct implementation for Adam.
Summary
- Regularization prevents overfitting by penalizing model complexity
- L2 (ridge): shrinks all coefficients, never zeros any out, Gaussian prior
- L1 (lasso): shrinks and zeros out coefficients, sparse solutions, Laplace prior
- Elastic net: combines L1 and L2, handles correlated features
- Regularization = Bayesian prior: L2 = Gaussian, L1 = Laplace
- Lambda controls bias-variance tradeoff: large lambda = more bias, less variance
- Cross-validation is the standard method for choosing lambda
- Early stopping, dropout, and data augmentation are implicit regularization
- Strong regularization implies algorithmic stability implies generalization
Exercises
Problem
For ridge regression, show that as and as (assuming is invertible).
Problem
Show that L2 regularization corresponds to a Gaussian prior on the coefficients. Specifically, prove that the ridge solution is the MAP estimate when and , and express in terms of and .
Related Comparisons
References
Canonical:
- Hoerl & Kennard, "Ridge Regression" (1970) -- the original ridge paper
- Tibshirani, "Regression Shrinkage and Selection via the Lasso" (1996)
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapters 3-4
Current:
-
Hastie, Tibshirani, Wainwright, Statistical Learning with Sparsity (2015)
-
Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (2019) -- AdamW
-
Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5
Next Topics
Building on regularization:
- Algorithmic stability: formalizing the regularization-generalization connection
- Kernels and RKHS: regularization in infinite-dimensional function spaces
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Bias-Variance TradeoffLayer 2
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Common Probability DistributionsLayer 0A
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
Builds on This
- GrokkingLayer 4
- Regularization in PracticeLayer 1