Regularization Theory

Sneiderman, Robby

Optimization Function Classes

Regularization Theory

Why unconstrained ERM overfits and how regularization controls complexity: Tikhonov (L2), sparsity (L1), elastic net, early stopping, dropout, the Bayesian prior connection, and the link to algorithmic stability.

CoreTier 2StableSupporting~25 min

Prerequisites

Convex Optimization Basics Bias Variance Tradeoff Adaboost Convex Duality

Start 8-question practice · 4 available Prereq Map

Learning position

Read this page in the graph.

optimization-function-classes | layer 2 | tier 2. This page has 10 direct prerequisites and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Algorithmic Stability

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Empirical risk minimization on a rich hypothesis class overfits: it finds a function that perfectly fits the training data, including its noise. Regularization is the primary tool for preventing this. Every practical ML algorithm uses regularization in some form -- explicit penalty terms, early stopping, dropout, data augmentation, or architectural constraints.

Understanding regularization theory tells you why these techniques work: they all control the complexity of the learned hypothesis, trading a small increase in bias for a large decrease in variance.

Mental Model

ERM minimizes $\hat{R}_n(h)$ -- the average training loss. If $\mathcal{H}$ is too rich, the minimizer overfits. Regularization adds a penalty:

$\hat{h}_\lambda = \arg\min_{h \in \mathcal{H}} \left[\hat{R}_n(h) + \lambda \Omega(h)\right]$

The penalty $\Omega(h)$ measures the complexity of $h$ . The regularization parameter $\lambda > 0$ controls the tradeoff:

Large $\lambda$ : strong penalty, simple models, high bias, low variance
Small $\lambda$ : weak penalty, complex models, low bias, high variance

The optimal $\lambda$ balances the bias-variance tradeoff to minimize total risk.

Tikhonov Regularization (L2 / Ridge)

Definition

Ridge Regression $\hat{β}_{ridge}$

For linear regression with squared loss, ridge regression solves:

$\hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \left[\frac{1}{n}\|y - X\beta\|^2 + \lambda\|\beta\|_2^2\right]$

The penalty $\|\beta\|_2^2 = \sum_j \beta_j^2$ shrinks all coefficients toward zero without setting any exactly to zero.

Proposition

Ridge Regression Closed Form

Statement

The ridge regression solution is:

$\hat{\beta}_{\text{ridge}} = (X^\top X + n\lambda I)^{-1} X^\top y$

In terms of the SVD $X = U \Sigma V^\top$ :

$\hat{\beta}_{\text{ridge}} = \sum_{j=1}^p \frac{\sigma_j^2}{\sigma_j^2 + n\lambda} \cdot \frac{u_j^\top y}{\sigma_j} v_j$

where $\sigma_j$ are the singular values.

Intuition

Ridge regression shrinks each coefficient by a factor $\sigma_j^2/(\sigma_j^2 + n\lambda)$ . Directions with large singular values (well-determined by the data) are barely shrunk. Directions with small singular values (poorly determined) are shrunk heavily. This is exactly what you want: trust the data where it is informative, shrink toward zero where it is not.

Proof Sketch

Take the gradient of the ridge objective and set it to zero:

$-\frac{2}{n}X^\top(y - X\beta) + 2\lambda\beta = 0$

Solving: $\beta = (X^\top X + n\lambda I)^{-1} X^\top y$ . The matrix $X^\top X + n\lambda I$ is always invertible for $\lambda > 0$ , even when $X^\top X$ is singular (more features than samples).

Why It Matters

Ridge regression resolves the ill-conditioning problem of OLS when $p \approx n$ or $p > n$ . The addition of $n\lambda I$ ensures the matrix is invertible. The SVD form shows exactly how regularization acts: it is a continuous shrinkage of the OLS coefficients, with more shrinkage in poorly determined directions.

Failure Mode

Ridge never produces exactly sparse solutions. If the true model is sparse (few nonzero coefficients), ridge will keep all coefficients nonzero, which hurts interpretability and, in high-dimensional settings, can lead to worse prediction.

report a correction →

Sparsity and L1 Regularization (Lasso)

Definition

Lasso $\hat{β}_{lasso}$

The Lasso (Least Absolute Shrinkage and Selection Operator) solves:

$\hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \left[\frac{1}{n}\|y - X\beta\|^2 + \lambda\|\beta\|_1\right]$

where $\|\beta\|_1 = \sum_j |\beta_j|$ . The L1 penalty produces sparse solutions: many coefficients are set exactly to zero.

Proposition

Lasso Produces Sparse Solutions

Statement

The Lasso solution $\hat{\beta}_{\text{lasso}}$ is sparse: for sufficiently large $\lambda$ , a subset of the coefficients are exactly zero. The set of nonzero coefficients $\{j : \hat\beta_j \neq 0\}$ is called the active set or support.

Intuition

The L1 penalty creates a diamond-shaped constraint region in parameter space. The corners of the diamond lie on the coordinate axes. The loss function (an ellipsoid for squared loss) typically first touches the diamond at a corner, where some coordinates are exactly zero. The L2 penalty creates a spherical constraint region with no corners, so the tangent point has all coordinates nonzero.

Proof Sketch

The subdifferential optimality condition for the Lasso is:

$-\frac{2}{n}X_j^\top(y - X\hat{\beta}) + \lambda s_j = 0$

where $s_j \in \partial|\hat\beta_j|$ : $s_j = \text{sign}(\hat\beta_j)$ if $\hat\beta_j \neq 0$ , and $s_j \in [-1, 1]$ if $\hat\beta_j = 0$ .

If $|\frac{2}{n}X_j^\top(y - X\hat\beta)| < \lambda$ , the only solution is $\hat\beta_j = 0$ (the gradient is not large enough to "escape" zero). This is the soft-thresholding mechanism that produces sparsity.

Why It Matters

Sparsity is the inductive bias that makes high-dimensional estimation tractable when $p \gg n$ . If only $s \ll p$ features are relevant, the Lasso can identify them and achieve the rate $O(s \log p / n)$ , which depends on the sparsity $s$ and only logarithmically on the total dimension $p$ . Without sparsity, consistent estimation in the $p > n$ regime is impossible.

Failure Mode

The Lasso performs poorly when features are highly correlated (the "collinearity" problem). With correlated features, the Lasso tends to select one and zero out the others, even if all are relevant. Elastic net addresses this.

report a correction →

Elastic Net

Definition

Elastic Net

The elastic net combines L1 and L2 penalties:

$\hat{\beta}_{\text{EN}} = \arg\min_{\beta} \left[\frac{1}{n}\|y - X\beta\|^2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2\right]$

This produces sparse solutions (from L1) while handling correlated features gracefully (from L2). When correlated features are present, elastic net tends to select them as a group rather than picking one arbitrarily.

The Bayesian Interpretation

There is a deep and exact correspondence between regularization penalties and Bayesian priors:

L2 regularization = Gaussian prior: Minimizing $\frac{1}{n}\|y - X\beta\|^2 + \lambda\|\beta\|_2^2$ is equivalent to finding the MAP (maximum a posteriori) estimate under the prior $\beta_j \sim \mathcal{N}(0, 1/(n\lambda))$ with Gaussian likelihood.

L1 regularization = Laplace prior: Minimizing $\frac{1}{n}\|y - X\beta\|^2 + \lambda\|\beta\|_1$ is equivalent to MAP estimation with a Laplace prior $p(\beta_j) \propto e^{-n\lambda|\beta_j|/2}$ . The sharp peak of the Laplace distribution at zero is what produces sparsity.

Definition

MAP Estimation and Regularization

The MAP estimate is:

$\hat\beta_{\text{MAP}} = \arg\max_\beta \left[\log p(y \mid \beta) + \log p(\beta)\right]$

With Gaussian likelihood $\log p(y \mid \beta) \propto -\frac{1}{2\sigma^2}\|y - X\beta\|^2$ , adding a prior $\log p(\beta)$ is exactly adding a regularization penalty:

$p(\beta) = \mathcal{N}(0, \tau^2 I)$ : gives $\log p(\beta) \propto -\|\beta\|_2^2/(2\tau^2)$ (ridge)
$p(\beta) = \text{Laplace}(0, b)$ : gives $\log p(\beta) \propto -\|\beta\|_1/b$ (lasso)

The regularization strength $\lambda = \sigma^2/\tau^2$ (for ridge) connects the noise variance and prior variance.

Implicit Regularization

Not all regularization comes from explicit penalty terms.

Early stopping: training a model with gradient descent and stopping before convergence is a form of regularization. For linear regression, the analogy is ridge-like, not exact: gradient descent on the squared loss applies the spectral filter $1 - (1 - \eta \sigma_i^2)^T$ to each singular component of $X$ , while ridge regression applies the filter $\sigma_i^2 / (\sigma_i^2 + \lambda)$ . These two filters have similar shape (both attenuate small singular directions) and an effective regularization strength $\lambda \approx 1/(\eta T)$ matches them in the small- $\sigma$ regime, but no single $\lambda$ makes the two filters identical across the spectrum. So early stopping behaves like ridge regression for the purposes of variance reduction without being its mathematical equivalent. Fewer iterations means stronger effective regularization.

Dropout: randomly zeroing out neurons during training is equivalent (in expectation for linear models) to an adaptive L2 penalty on the weights. For nonlinear networks, dropout acts as an approximate Bayesian ensemble.

Data augmentation: training on augmented data is equivalent to adding a regularization term that encourages invariance to the augmentation transformations.

Batch normalization, weight decay, finite learning rate: all induce implicit regularization effects that can be analyzed formally in specific settings.

Bias-Variance Tradeoff with Lambda

The regularization strength $\lambda$ directly controls the bias-variance tradeoff:

$\text{Risk}(\lambda) = \text{Bias}^2(\lambda) + \text{Variance}(\lambda)$

For ridge regression:

Bias: $\text{Bias}^2 = \lambda^2 \beta^\top (X^\top X + n\lambda I)^{-2} X^\top X \beta$ -- increases with $\lambda$
Variance: $\text{Var} = \frac{\sigma^2}{n} \text{tr}(X^\top X (X^\top X + n\lambda I)^{-2})$ -- decreases with $\lambda$

The optimal $\lambda^*$ minimizes the total risk. In practice, $\lambda^*$ is chosen by cross-validation.

Cross-Validation for Choosing Lambda

Definition

K-Fold Cross-Validation for Lambda

To choose $\lambda$ by K-fold cross-validation:

Split the training data into $K$ folds
For each candidate $\lambda$ $λ$ and each fold $k$ $k$ :
- Train on all folds except $k$ with regularization $\lambda$
- Evaluate the loss on fold $k$
Average the loss over all $K$ folds for each $\lambda$
Choose $\lambda^*$ minimizing the average cross-validation loss

Standard choices: $K = 5$ or $K = 10$ . Leave-one-out ( $K = n$ ) has a closed-form solution for ridge regression.

Connection to Algorithmic Stability

Regularization has a formal connection to stability:

A learning algorithm $A$ is $\beta$ -uniformly stable if and only if for every pair of training samples $S, S^{(i)}$ that differ in one example and every test point $z$ , $|\ell(A(S), z) - \ell(A(S^{(i)}), z)| \leq \beta$ . Stability is defined on the loss, not the prediction; it implies generalization through a direct argument of Bousquet and Elisseeff (2002) without going through a norm-on-predictions detour. Strong regularization makes algorithms more stable, because the penalty prevents the solution from depending too sensitively on any single data point.

For ridge regression with parameter $\lambda$ , the uniform stability is $\beta = O(1/(n\lambda))$ . Strong regularization (large $\lambda$ ) gives small $\beta$ (high stability), which in turn gives tight generalization bounds.

This creates a clean theoretical chain: regularization implies stability implies generalization.

Canonical Examples

Example

Ridge vs. OLS when p is close to n

Consider $n = 100$ observations and $p = 90$ features. OLS overfits badly because $X^\top X$ is nearly singular (condition number is huge). Ridge with $\lambda = 0.1$ adds $0.1\,I$ to $X^\top X$ , stabilizing the inversion. The SVD form shows that the 90th singular value (nearly zero for OLS) gets shrunk heavily, while the first few large singular values are barely affected. Test MSE drops dramatically.

Example

Lasso for feature selection

In a genomics dataset with $n = 200$ patients and $p = 10{,}000$ genes, most genes are irrelevant. Lasso with cross-validated $\lambda$ selects 15 genes (nonzero coefficients) and zeros out the rest. The selected genes are interpretable biomarkers. Ridge regression would keep all 10,000 genes with small but nonzero coefficients, losing interpretability.

Evaluation Ladder

A regularizer is not better because it lowers training loss. It is better only relative to a failure mode in the data, model, or deployment setting.

Question	Measurement	What it catches
Does it reduce variance?	Train-validation gap across seeds and folds	A penalty that only slows training without improving holdout loss
Does it preserve signal?	Coefficient path, support recovery, or feature-group stability	A sparse model that deletes correlated but relevant predictors
Does it improve calibration?	Reliability curves, Brier score, expected calibration error	A model with better accuracy but worse probability estimates
Does it survive shift?	Same metrics on time, site, or subgroup slices	A tuning choice that only fits the validation split
Does it cost less?	Parameters, FLOPs, memory, and retraining time	A penalty that improves a benchmark but is too expensive to use

The falsifiable regularization claim names the target failure mode first: "ridge reduces variance in ill-conditioned linear regression" is measurable; "ridge is more stable" is incomplete until it states the data regime and metric.

Worked Diagnostic Pattern

When comparing L1, L2, elastic net, dropout, or early stopping, keep the model class and tuning budget fixed:

Train the same architecture without the regularizer.
Tune the regularization strength on the same validation protocol.
Report the full coefficient or weight path, not only the best score.
Evaluate at least one out-of-split slice: time split, subgroup split, or synthetic collinearity stress test.

This separates three effects that are often conflated: lower estimator variance, better optimization dynamics, and a validation protocol that happened to favor one hyperparameter grid.

Common Confusions

Watch Out

Regularization does not always mean a penalty term

Early stopping, dropout, data augmentation, and even using a small network are all forms of regularization. The common thread is restricting the effective complexity of the learned function, not the specific mechanism. The explicit penalty $\lambda\Omega(h)$ is just the most mathematically tractable form.

Watch Out

L1 does not always outperform L2

L1 (lasso) is better when the true model is sparse. L2 (ridge) is better when all features contribute roughly equally with small coefficients. In practice, you do not know which setting you are in, which is why elastic net (L1 + L2) and cross-validation are the safe defaults.

Watch Out

Weight decay is not exactly L2 regularization with Adam

For SGD, weight decay ( $\theta \leftarrow (1 - \lambda)\theta - \eta\nabla L$ ) and L2 regularization ( $\nabla(L + \lambda\|\theta\|^2/2)$ ) are equivalent. For adaptive optimizers like Adam, they differ because the adaptive learning rate rescales the gradient and the regularization term differently. "Decoupled weight decay" (AdamW) applies weight decay directly, which is the correct implementation for Adam.

Summary

Regularization prevents overfitting by penalizing model complexity
L2 (ridge): shrinks all coefficients, never zeros any out, Gaussian prior
L1 (lasso): shrinks and zeros out coefficients, sparse solutions, Laplace prior
Elastic net: combines L1 and L2, handles correlated features
Regularization = Bayesian prior: L2 = Gaussian, L1 = Laplace
Lambda controls bias-variance tradeoff: large lambda = more bias, less variance
Cross-validation is the standard method for choosing lambda
Early stopping, dropout, and data augmentation are implicit regularization
Strong regularization implies algorithmic stability implies generalization

Exercises

ExerciseCore

Problem

For ridge regression, show that $\hat\beta_{\text{ridge}}(\lambda) \to 0$ as $\lambda \to \infty$ and $\hat\beta_{\text{ridge}}(\lambda) \to \hat\beta_{\text{OLS}}$ as $\lambda \to 0^+$ (assuming $X^\top X$ is invertible).

ExerciseAdvanced

Problem

Show that L2 regularization corresponds to a Gaussian prior on the coefficients. Specifically, prove that the ridge solution is the MAP estimate when $\beta \sim \mathcal{N}(0, \tau^2 I)$ and $y \mid \beta \sim \mathcal{N}(X\beta, \sigma^2 I)$ , and express $\lambda$ in terms of $\sigma^2$ and $\tau^2$ .

Related Comparisons

Early Stopping vs. Weight Decay

References

Canonical:

Hoerl & Kennard, "Ridge Regression" (1970) -- the original ridge paper
Tibshirani, "Regression Shrinkage and Selection via the Lasso" (1996)
Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapters 3-4

Current:

Hastie, Tibshirani, Wainwright, Statistical Learning with Sparsity (2015)
Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (2019) -- AdamW

Next Topics

Building on regularization:

Algorithmic stability: formalizing the regularization-generalization connection
Kernels and RKHS: regularization in infinite-dimensional function spaces

Last reviewed: July 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

10

Maximum A Posteriori (MAP) Estimationlayer 0B · tier 1
Convex Optimization Basicslayer 1 · tier 1
Convex Dualitylayer 2 · tier 1
Gradient Boostinglayer 2 · tier 1
Overfitting and Underfittinglayer 2 · tier 1

Derived topics

4

Regularization in Practicelayer 2 · tier 1
Algorithmic Stabilitylayer 3 · tier 1
Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
Grokkinglayer 4 · tier 2

Graph-backed continuations

Algorithmic Stability Kernels and Reproducing Kernel Hilbert Spaces Grokking Regularization in Practice