Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Optimization Function Classes

Regularization Theory

Why unconstrained ERM overfits and how regularization controls complexity: Tikhonov (L2), sparsity (L1), elastic net, early stopping, dropout, the Bayesian prior connection, and the link to algorithmic stability.

CoreTier 2Stable~60 min

Why This Matters

Empirical risk minimization on a rich hypothesis class overfits: it finds a function that perfectly fits the training data, including its noise. Regularization is the primary tool for preventing this. Every practical ML algorithm uses regularization in some form -- explicit penalty terms, early stopping, dropout, data augmentation, or architectural constraints.

Understanding regularization theory tells you why these techniques work: they all control the complexity of the learned hypothesis, trading a small increase in bias for a large decrease in variance.

Mental Model

ERM minimizes R^n(h)\hat{R}_n(h) -- the average training loss. If H\mathcal{H} is too rich, the minimizer overfits. Regularization adds a penalty:

h^λ=argminhH[R^n(h)+λΩ(h)]\hat{h}_\lambda = \arg\min_{h \in \mathcal{H}} \left[\hat{R}_n(h) + \lambda \Omega(h)\right]

The penalty Ω(h)\Omega(h) measures the complexity of hh. The regularization parameter λ>0\lambda > 0 controls the tradeoff:

  • Large λ\lambda: strong penalty, simple models, high bias, low variance
  • Small λ\lambda: weak penalty, complex models, low bias, high variance

The optimal λ\lambda balances the bias-variance tradeoff to minimize total risk.

Tikhonov Regularization (L2 / Ridge)

Definition

Ridge Regression

For linear regression with squared loss, ridge regression solves:

β^ridge=argminβ[1nyXβ2+λβ22]\hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \left[\frac{1}{n}\|y - X\beta\|^2 + \lambda\|\beta\|_2^2\right]

The penalty β22=jβj2\|\beta\|_2^2 = \sum_j \beta_j^2 shrinks all coefficients toward zero without setting any exactly to zero.

Proposition

Ridge Regression Closed Form

Statement

The ridge regression solution is:

β^ridge=(XX+nλI)1Xy\hat{\beta}_{\text{ridge}} = (X^\top X + n\lambda I)^{-1} X^\top y

In terms of the SVD X=UΣVX = U \Sigma V^\top:

β^ridge=j=1pσj2σj2+nλujyσjvj\hat{\beta}_{\text{ridge}} = \sum_{j=1}^p \frac{\sigma_j^2}{\sigma_j^2 + n\lambda} \cdot \frac{u_j^\top y}{\sigma_j} v_j

where σj\sigma_j are the singular values.

Intuition

Ridge regression shrinks each coefficient by a factor σj2/(σj2+nλ)\sigma_j^2/(\sigma_j^2 + n\lambda). Directions with large singular values (well-determined by the data) are barely shrunk. Directions with small singular values (poorly determined) are shrunk heavily. This is exactly what you want: trust the data where it is informative, shrink toward zero where it is not.

Proof Sketch

Take the gradient of the ridge objective and set it to zero:

2nX(yXβ)+2λβ=0-\frac{2}{n}X^\top(y - X\beta) + 2\lambda\beta = 0

Solving: β=(XX+nλI)1Xy\beta = (X^\top X + n\lambda I)^{-1} X^\top y. The matrix XX+nλIX^\top X + n\lambda I is always invertible for λ>0\lambda > 0, even when XXX^\top X is singular (more features than samples).

Why It Matters

Ridge regression resolves the ill-conditioning problem of OLS when pnp \approx n or p>np > n. The addition of nλIn\lambda I ensures the matrix is invertible. The SVD form shows exactly how regularization acts: it is a continuous shrinkage of the OLS coefficients, with more shrinkage in poorly determined directions.

Failure Mode

Ridge never produces exactly sparse solutions. If the true model is sparse (few nonzero coefficients), ridge will keep all coefficients nonzero, which hurts interpretability and, in high-dimensional settings, can lead to worse prediction.

Sparsity and L1 Regularization (Lasso)

Definition

Lasso

The Lasso (Least Absolute Shrinkage and Selection Operator) solves:

β^lasso=argminβ[1nyXβ2+λβ1]\hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \left[\frac{1}{n}\|y - X\beta\|^2 + \lambda\|\beta\|_1\right]

where β1=jβj\|\beta\|_1 = \sum_j |\beta_j|. The L1 penalty produces sparse solutions: many coefficients are set exactly to zero.

Proposition

Lasso Produces Sparse Solutions

Statement

The Lasso solution β^lasso\hat{\beta}_{\text{lasso}} is sparse: for sufficiently large λ\lambda, a subset of the coefficients are exactly zero. The set of nonzero coefficients {j:β^j0}\{j : \hat\beta_j \neq 0\} is called the active set or support.

Intuition

The L1 penalty creates a diamond-shaped constraint region in parameter space. The corners of the diamond lie on the coordinate axes. The loss function (an ellipsoid for squared loss) typically first touches the diamond at a corner, where some coordinates are exactly zero. The L2 penalty creates a spherical constraint region with no corners, so the tangent point has all coordinates nonzero.

Proof Sketch

The subdifferential optimality condition for the Lasso is:

2nXj(yXβ^)+λsj=0-\frac{2}{n}X_j^\top(y - X\hat{\beta}) + \lambda s_j = 0

where sjβ^js_j \in \partial|\hat\beta_j|: sj=sign(β^j)s_j = \text{sign}(\hat\beta_j) if β^j0\hat\beta_j \neq 0, and sj[1,1]s_j \in [-1, 1] if β^j=0\hat\beta_j = 0.

If 2nXj(yXβ^)<λ|\frac{2}{n}X_j^\top(y - X\hat\beta)| < \lambda, the only solution is β^j=0\hat\beta_j = 0 (the gradient is not large enough to "escape" zero). This is the soft-thresholding mechanism that produces sparsity.

Why It Matters

Sparsity is a powerful inductive bias for high-dimensional problems where pnp \gg n. If only sps \ll p features are relevant, the Lasso can identify them and achieve the rate O(slogp/n)O(s \log p / n), which depends on the sparsity ss and only logarithmically on the total dimension pp. Without sparsity, consistent estimation in the p>np > n regime is impossible.

Failure Mode

The Lasso performs poorly when features are highly correlated (the "collinearity" problem). With correlated features, the Lasso tends to select one and zero out the others, even if all are relevant. Elastic net addresses this.

Elastic Net

Definition

Elastic Net

The elastic net combines L1 and L2 penalties:

β^EN=argminβ[1nyXβ2+λ1β1+λ2β22]\hat{\beta}_{\text{EN}} = \arg\min_{\beta} \left[\frac{1}{n}\|y - X\beta\|^2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2\right]

This produces sparse solutions (from L1) while handling correlated features gracefully (from L2). When correlated features are present, elastic net tends to select them as a group rather than picking one arbitrarily.

The Bayesian Interpretation

There is a deep and exact correspondence between regularization penalties and Bayesian priors:

L2 regularization = Gaussian prior: Minimizing 1nyXβ2+λβ22\frac{1}{n}\|y - X\beta\|^2 + \lambda\|\beta\|_2^2 is equivalent to finding the MAP (maximum a posteriori) estimate under the prior βjN(0,1/(nλ))\beta_j \sim \mathcal{N}(0, 1/(n\lambda)) with Gaussian likelihood.

L1 regularization = Laplace prior: Minimizing 1nyXβ2+λβ1\frac{1}{n}\|y - X\beta\|^2 + \lambda\|\beta\|_1 is equivalent to MAP estimation with a Laplace prior p(βj)enλβj/2p(\beta_j) \propto e^{-n\lambda|\beta_j|/2}. The sharp peak of the Laplace distribution at zero is what produces sparsity.

Definition

MAP Estimation and Regularization

The MAP estimate is:

β^MAP=argmaxβ[logp(yβ)+logp(β)]\hat\beta_{\text{MAP}} = \arg\max_\beta \left[\log p(y \mid \beta) + \log p(\beta)\right]

With Gaussian likelihood logp(yβ)12σ2yXβ2\log p(y \mid \beta) \propto -\frac{1}{2\sigma^2}\|y - X\beta\|^2, adding a prior logp(β)\log p(\beta) is exactly adding a regularization penalty:

  • p(β)=N(0,τ2I)p(\beta) = \mathcal{N}(0, \tau^2 I): gives logp(β)β22/(2τ2)\log p(\beta) \propto -\|\beta\|_2^2/(2\tau^2) (ridge)
  • p(β)=Laplace(0,b)p(\beta) = \text{Laplace}(0, b): gives logp(β)β1/b\log p(\beta) \propto -\|\beta\|_1/b (lasso)

The regularization strength λ=σ2/τ2\lambda = \sigma^2/\tau^2 (for ridge) connects the noise variance and prior variance.

Implicit Regularization

Not all regularization comes from explicit penalty terms.

Early stopping: training a model with gradient descent and stopping before convergence is a form of regularization. In linear regression, gradient descent with step size η\eta after TT steps produces a solution equivalent to ridge regression with λ1/(ηT)\lambda \approx 1/(\eta T). Fewer iterations means stronger regularization.

Dropout: randomly zeroing out neurons during training is equivalent (in expectation for linear models) to an adaptive L2 penalty on the weights. For nonlinear networks, dropout acts as an approximate Bayesian ensemble.

Data augmentation: training on augmented data is equivalent to adding a regularization term that encourages invariance to the augmentation transformations.

Batch normalization, weight decay, finite learning rate: all induce implicit regularization effects that can be analyzed formally in specific settings.

Bias-Variance Tradeoff with Lambda

The regularization strength λ\lambda directly controls the bias-variance tradeoff:

Risk(λ)=Bias2(λ)+Variance(λ)\text{Risk}(\lambda) = \text{Bias}^2(\lambda) + \text{Variance}(\lambda)

For ridge regression:

  • Bias: Bias2=λ2β(XX+nλI)2XXβ\text{Bias}^2 = \lambda^2 \beta^\top (X^\top X + n\lambda I)^{-2} X^\top X \beta -- increases with λ\lambda
  • Variance: Var=σ2ntr(XX(XX+nλI)2)\text{Var} = \frac{\sigma^2}{n} \text{tr}(X^\top X (X^\top X + n\lambda I)^{-2}) -- decreases with λ\lambda

The optimal λ\lambda^* minimizes the total risk. In practice, λ\lambda^* is chosen by cross-validation.

Cross-Validation for Choosing Lambda

Definition

K-Fold Cross-Validation for Lambda

To choose λ\lambda by K-fold cross-validation:

  1. Split the training data into KK folds
  2. For each candidate λ\lambda and each fold kk:
    • Train on all folds except kk with regularization λ\lambda
    • Evaluate the loss on fold kk
  3. Average the loss over all KK folds for each λ\lambda
  4. Choose λ\lambda^* minimizing the average cross-validation loss

Standard choices: K=5K = 5 or K=10K = 10. Leave-one-out (K=nK = n) has a closed-form solution for ridge regression.

Connection to Algorithmic Stability

Regularization has a formal connection to stability:

A learning algorithm is β\beta-uniformly stable if replacing one training example changes the predictions by at most β\beta (in a suitable norm). Strong regularization makes algorithms more stable, because the penalty prevents the solution from depending too sensitively on any single data point.

For ridge regression with parameter λ\lambda, the uniform stability is β=O(1/(nλ))\beta = O(1/(n\lambda)). Strong regularization (large λ\lambda) gives small β\beta (high stability), which in turn gives tight generalization bounds.

This creates a clean theoretical chain: regularization implies stability implies generalization.

Canonical Examples

Example

Ridge vs. OLS when p is close to n

Consider n=100n = 100 observations and p=90p = 90 features. OLS overfits badly because XXX^\top X is nearly singular (condition number is huge). Ridge with λ=0.1\lambda = 0.1 adds 0.1I0.1 I to XXX^\top X, stabilizing the inversion. The SVD form shows that the 90th singular value (nearly zero for OLS) gets shrunk heavily, while the first few large singular values are barely affected. Test MSE drops dramatically.

Example

Lasso for feature selection

In a genomics dataset with n=200n = 200 patients and p=10,000p = 10{,}000 genes, most genes are irrelevant. Lasso with cross-validated λ\lambda selects 15 genes (nonzero coefficients) and zeros out the rest. The selected genes are interpretable biomarkers. Ridge regression would keep all 10,000 genes with small but nonzero coefficients, losing interpretability.

Common Confusions

Watch Out

Regularization does not always mean a penalty term

Early stopping, dropout, data augmentation, and even using a small network are all forms of regularization. The common thread is restricting the effective complexity of the learned function, not the specific mechanism. The explicit penalty λΩ(h)\lambda\Omega(h) is just the most mathematically tractable form.

Watch Out

L1 does not always outperform L2

L1 (lasso) is better when the true model is sparse. L2 (ridge) is better when all features contribute roughly equally with small coefficients. In practice, you do not know which setting you are in, which is why elastic net (L1 + L2) and cross-validation are the safe defaults.

Watch Out

Weight decay is not exactly L2 regularization with Adam

For SGD, weight decay (θ(1λ)θηL\theta \leftarrow (1 - \lambda)\theta - \eta\nabla L) and L2 regularization ((L+λθ2/2)\nabla(L + \lambda\|\theta\|^2/2)) are equivalent. For adaptive optimizers like Adam, they differ because the adaptive learning rate rescales the gradient and the regularization term differently. "Decoupled weight decay" (AdamW) applies weight decay directly, which is the correct implementation for Adam.

Summary

  • Regularization prevents overfitting by penalizing model complexity
  • L2 (ridge): shrinks all coefficients, never zeros any out, Gaussian prior
  • L1 (lasso): shrinks and zeros out coefficients, sparse solutions, Laplace prior
  • Elastic net: combines L1 and L2, handles correlated features
  • Regularization = Bayesian prior: L2 = Gaussian, L1 = Laplace
  • Lambda controls bias-variance tradeoff: large lambda = more bias, less variance
  • Cross-validation is the standard method for choosing lambda
  • Early stopping, dropout, and data augmentation are implicit regularization
  • Strong regularization implies algorithmic stability implies generalization

Exercises

ExerciseCore

Problem

For ridge regression, show that β^ridge(λ)0\hat\beta_{\text{ridge}}(\lambda) \to 0 as λ\lambda \to \infty and β^ridge(λ)β^OLS\hat\beta_{\text{ridge}}(\lambda) \to \hat\beta_{\text{OLS}} as λ0+\lambda \to 0^+ (assuming XXX^\top X is invertible).

ExerciseAdvanced

Problem

Show that L2 regularization corresponds to a Gaussian prior on the coefficients. Specifically, prove that the ridge solution is the MAP estimate when βN(0,τ2I)\beta \sim \mathcal{N}(0, \tau^2 I) and yβN(Xβ,σ2I)y \mid \beta \sim \mathcal{N}(X\beta, \sigma^2 I), and express λ\lambda in terms of σ2\sigma^2 and τ2\tau^2.

Related Comparisons

References

Canonical:

  • Hoerl & Kennard, "Ridge Regression" (1970) -- the original ridge paper
  • Tibshirani, "Regression Shrinkage and Selection via the Lasso" (1996)
  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapters 3-4

Current:

  • Hastie, Tibshirani, Wainwright, Statistical Learning with Sparsity (2015)

  • Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (2019) -- AdamW

  • Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5

Next Topics

Building on regularization:

  • Algorithmic stability: formalizing the regularization-generalization connection
  • Kernels and RKHS: regularization in infinite-dimensional function spaces

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics