Regularization in Practice

Sneiderman, Robby

Training Techniques

Regularization in Practice

Practical guide to regularization: L2 (weight decay), L1 (sparsity), dropout, early stopping, data augmentation, and batch normalization. How each constrains model complexity and how to choose between them.

CoreTier 1StableSupporting~45 min

Prerequisites

Regularization Theory Cross Entropy Loss Deep Dive

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

training-techniques | layer 2 | tier 1. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Batch Normalization

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A model that memorizes the training set is useless. Regularization is the toolkit for preventing this. Every production ML model uses at least one form of regularization, and most use several simultaneously. The theoretical foundations are covered in the regularization theory topic. This page focuses on practical application: what each technique does, when to use it, and how to combine them.

Mental Model

All regularization techniques share one goal: reduce the gap between training performance and test performance (the bias-variance tradeoff). They do this by constraining the effective complexity of the model, either explicitly (penalty terms, architectural constraints) or implicitly (noise injection, early termination).

L2 Regularization (Weight Decay)

Definition

L2 Regularization $λ ∥ θ ∥_{2}^{2}$

Add a penalty proportional to the squared L2 norm of the parameters to the loss:

$L_{\text{reg}}(\theta) = L(\theta) + \frac{\lambda}{2} \|\theta\|_2^2$

where $\lambda > 0$ controls the penalty strength.

Proposition

L2 Regularization Shrinks Weights

Statement

With L2 regularization, the gradient descent update becomes:

$\theta_{t+1} = \theta_t - \eta(\nabla L(\theta_t) + \lambda \theta_t) = (1 - \eta\lambda)\theta_t - \eta \nabla L(\theta_t)$

Each step multiplies the weights by $(1 - \eta\lambda) < 1$ before applying the gradient update. Weights that are not reinforced by the gradient shrink exponentially toward zero.

Intuition

L2 regularization applies a constant friction to all weights. Large weights are penalized more (the penalty is quadratic), so the optimizer prefers solutions with many small weights over solutions with a few large weights. This prevents the model from relying too heavily on any single feature.

Proof Sketch

Differentiate $L_{\text{reg}}$ with respect to $\theta$ : $\nabla L_{\text{reg}} = \nabla L + \lambda \theta$ . Substitute into the gradient descent update rule and factor out $\theta_t$ .

Why It Matters

L2 regularization is the most widely used explicit regularizer. In PyTorch, the weight_decay parameter in optimizers implements exactly this. A typical value is $\lambda = 10^{-4}$ to $10^{-2}$ .

Failure Mode

L2 regularization shrinks all weights toward zero but never sets them exactly to zero. If you need sparse models (feature selection), L2 is the wrong choice. Use L1 instead. Also, for Adam, naive L2 regularization and weight decay are not equivalent. Use AdamW (decoupled weight decay) for correct behavior.

report a correction →

L1 Regularization (Sparsity)

Definition

L1 Regularization $λ ∥ θ ∥_{1}$

Add a penalty proportional to the L1 norm of the parameters:

$L_{\text{reg}}(\theta) = L(\theta) + \lambda \|\theta\|_1 = L(\theta) + \lambda \sum_j |\theta_j|$

L1 regularization drives small weights to exactly zero, producing sparse models. This is useful when you believe many features are irrelevant and want the model to automatically select a subset.

Why L1 Gives Sparsity: The Geometry

The geometric explanation is clearer than the algebraic one. Consider minimizing $L(\theta)$ subject to the constraint $\|\theta\|_1 \leq t$ (equivalent to the penalized form for some $\lambda$ ). The L1 ball $\|\theta\|_1 \leq t$ is a diamond (rhombus in 2D, cross-polytope in higher dimensions) with corners on the coordinate axes. The loss function has elliptical contours (for quadratic losses). The constrained optimum occurs where the loss contour first touches the constraint set. Because the L1 ball has sharp corners on the axes, the contact point typically lands on a corner, which means one or more coordinates are exactly zero.

Contrast with L2: the L2 ball $\|\theta\|_2 \leq t$ is a sphere. Elliptical contours meet a sphere at a point that is generically not on any axis. L2 shrinks all weights but sets none to zero.

The subgradient argument makes this precise: the subgradient of $|\theta_j|$ at $\theta_j = 0$ is any value in $[-1, 1]$ . The optimality condition $0 \in \partial L / \partial \theta_j + \lambda \cdot [-1, 1]$ is satisfied at $\theta_j = 0$ whenever $|\partial L / \partial \theta_j| \leq \lambda$ . For L2, the gradient of $\theta_j^2$ at zero is $0$ , which provides no thresholding effect.

Typical values: $\lambda = 10^{-5}$ to $10^{-3}$ for neural networks. L1 is more commonly used in linear models (Lasso) than in deep learning.

Elastic Net

Definition

Elastic Net Regularization $λ_{1} ∥ θ ∥_{1} + λ_{2} ∥ θ ∥_{2}^{2}$

The elastic net combines L1 and L2 penalties:

$L_{\text{reg}}(\theta) = L(\theta) + \lambda_1 \|\theta\|_1 + \lambda_2 \|\theta\|_2^2$

where $\lambda_1$ controls sparsity and $\lambda_2$ controls weight magnitude.

Elastic net solves a specific failure mode of pure L1: when features are correlated, L1 tends to select one and zero out the rest. The L2 term groups correlated features together. In linear models, the elastic net penalty is $\alpha \|\theta\|_1 + (1-\alpha)\|\theta\|_2^2/2$ , where $\alpha \in [0,1]$ interpolates between ridge ( $\alpha = 0$ ) and lasso ( $\alpha = 1$ ).

For deep learning, elastic net is less common because the model already has enough capacity to distribute weight across correlated features. It is standard in linear models and frequently used in genomics and high-dimensional statistics.

Dropout

Definition

Dropout

During training, randomly set each neuron's output to zero with probability $p$ (typically 0.1 to 0.5). During inference, use all neurons but scale outputs by $(1 - p)$ (or equivalently, scale training outputs by $1/(1-p)$ using inverted dropout).

Proposition

Dropout as Approximate Model Averaging

Statement

A neural network with $d$ dropout-eligible neurons implicitly defines $2^d$ sub-networks (one per possible binary mask). Training with dropout approximately trains all $2^d$ sub-networks with shared weights. At inference, the scaled full network approximates the geometric mean of all sub-network predictions:

$p_{\text{ensemble}}(y|x) \approx \frac{1}{2^d} \sum_{m} p_m(y|x)$

where the sum is over all masks $m$ .

Intuition

Dropout forces the network to learn redundant representations. No neuron can rely on any other specific neuron being present, so each neuron must be useful on its own. This prevents co-adaptation (neurons specializing only in combination with specific other neurons).

Proof Sketch

Each training step samples a mask $m \sim \text{Bernoulli}(1-p)^d$ and trains the corresponding sub-network. The weight sharing means updates to one sub-network affect all sub-networks that share those weights. At inference, the weight-scaled full network gives the same expected output as the mean over sub-networks for linear models, and approximates it for non-linear models.

Why It Matters

Dropout is the most common regularizer for neural networks after weight decay. It adds zero computational cost during training (just masking operations) and is trivially simple to implement.

Failure Mode

Dropout is less effective (and sometimes harmful) with batch normalization. The interaction between dropout noise and batch statistics can destabilize training. In practice, many modern architectures (ResNets, Transformers) use weight decay and batch/layer normalization instead of dropout, except in fully connected layers.

report a correction →

Early Stopping

Definition

Early Stopping

Monitor validation loss during training. Stop training when validation loss has not improved for a specified number of epochs (the patience). Return the model parameters from the epoch with the lowest validation loss.

Early stopping works because training loss decreases monotonically (with enough data and small enough learning rate), but validation loss eventually increases as the model overfits. The point of divergence is the optimal stopping time.

Early Stopping as Implicit Regularization

Early stopping is equivalent to L2 regularization in certain settings.

Proposition

Early Stopping and L2 Regularization Equivalence

Statement

For linear regression with gradient descent starting from $\theta_0 = 0$ and learning rate $\eta$ , the iterate $\theta_t$ after $t$ steps matches the L2-regularized solution with penalty $\lambda = 1/(\eta t)$ in the following sense: both suppress the $k$ -th eigendirection of $X^TX$ by a factor that depends on the eigenvalue $\sigma_k$ . Early stopping gives suppression $(1 - \eta\sigma_k)^t$ , while L2 gives $\sigma_k/(\sigma_k + \lambda)$ .

Intuition

Gradient descent with limited steps cannot fully fit directions with small eigenvalues (slow-learning directions). L2 regularization penalizes large weights, which also suppresses directions with small eigenvalues. Both methods effectively truncate the low-variance components of the model. The implicit regularization strength is inversely proportional to the number of training steps.

Proof Sketch

In the eigenbasis of $X^TX$ , the $k$ -th component of $\theta_t$ evolves as $\theta_t^{(k)} = (1 - (1-\eta\sigma_k)^t) \theta_*^{(k)}$ , where $\theta_*$ is the OLS solution. The L2 solution gives $\hat{\theta}^{(k)} = \sigma_k/(\sigma_k + \lambda) \cdot \theta_*^{(k)}$ . For $\eta\sigma_k$ small, $(1-\eta\sigma_k)^t \approx e^{-\eta\sigma_k t}$ , and matching the suppression factors gives $\lambda \approx 1/(\eta t)$ .

Why It Matters

This equivalence explains why early stopping is effective even without an explicit penalty term. It also explains why the number of training steps is a regularization hyperparameter: fewer steps means stronger regularization.

Failure Mode

The equivalence is exact only for quadratic losses with linear models. For neural networks, the loss landscape is non-quadratic, the Hessian changes during training, and the implicit bias of SGD interacts with the stopping time in ways that do not reduce to L2 regularization.

report a correction →

Data Augmentation as Implicit Regularization

Data augmentation (random crops, flips, rotations for images; synonym replacement for text) expands the effective training set. This is implicit regularization because it constrains the model to be invariant to the augmentation transformations.

A model trained on randomly cropped images cannot memorize pixel-level details of training images, because those details change every epoch. The model is forced to learn features that are robust to small spatial perturbations.

Batch Normalization as Implicit Regularization

Batch normalization normalizes activations using batch statistics (mean and variance computed over the mini-batch). The mini-batch statistics are noisy estimates of the population statistics, and this noise acts as a regularizer. Larger batch sizes reduce this noise, which is one reason large-batch training sometimes generalizes worse.

Weight Decay vs L2 Regularization in Adaptive Optimizers

For SGD, weight decay and L2 regularization produce identical updates:

$\theta \leftarrow (1 - \eta\lambda)\theta - \eta \nabla L(\theta) = \theta - \eta(\nabla L(\theta) + \lambda\theta)$

For Adam and other adaptive methods, they diverge. Adam divides the gradient by the running estimate of its second moment $\sqrt{v_t}$ . With L2 regularization, the penalty gradient $\lambda\theta$ is also divided by $\sqrt{v_t}$ , which reduces the effective regularization on parameters with large gradients. With decoupled weight decay (AdamW), the shrinkage $(1 - \lambda)\theta$ is applied before the adaptive step, so all parameters are regularized equally regardless of gradient magnitude.

The practical difference: Adam with L2 regularization under-regularizes parameters in frequently updated directions and over-regularizes parameters in rarely updated directions. AdamW avoids this asymmetry. For transformer training, AdamW with weight decay $\lambda = 0.01$ to $0.1$ is the standard choice.

Dropout as Approximate Bayesian Inference

Watch Out

Dropout is not just ensembling

The ensemble interpretation (averaging $2^d$ sub-networks) is the standard explanation. A deeper connection: Gal and Ghahramani (2016) showed that a network trained with dropout is approximately performing variational inference over the weights, with the variational family fixed to a mixture of two point masses (zero and the learned weight). Running dropout at test time and averaging predictions then approximates the posterior predictive distribution, and the variance across dropout masks gives a cheap epistemic uncertainty estimate; this is MC Dropout. The interpretation is approximate, not exact, and the resulting uncertainty is not automatically calibrated — empirical studies (Foong et al. 2020, Folgoc et al. 2021) show MC Dropout uncertainty can be miscalibrated and that ensembles or proper Bayesian methods often calibrate better. Treat MC Dropout as a fast baseline for predictive uncertainty, not as a guarantee of calibration.

The approximation is closest to a real variational posterior when dropout is applied before every weight layer (not just FC layers). Quality degrades for very deep networks and small dropout rates.

Practical Guidelines for Regularization Strength

Choosing $\lambda$ (or dropout rate $p$ , or patience for early stopping) is a model selection problem. There is no closed-form answer. The standard approach:

Grid search or random search over $\lambda$ on a log scale: $\lambda \in \{10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}\}$ .
Monitor the train-val gap: if the gap is large, increase regularization. If both train and val loss are high, decrease regularization (you are underfitting).
Scale with model size: larger models typically need stronger regularization. For transformers, weight decay of $0.1$ is common; for small CNNs, $10^{-4}$ to $10^{-3}$ .
Combine orthogonal regularizers: weight decay constrains weight magnitude, dropout prevents co-adaptation, data augmentation enforces invariances. These address different failure modes and compose well.
Do not regularize biases: the standard practice is to exclude bias terms from weight decay. Biases have one parameter per neuron and contribute negligibly to overfitting. Regularizing them can harm optimization by restricting the network's ability to shift activation distributions.

Comparison Table of Regularization Methods

Method	Mechanism	Sparsity?	Typical strength	Best for	Watch out
L2 (weight decay)	Quadratic penalty on weight magnitude	No	$\lambda = 10^{-4}$ to $10^{-1}$	Default for all models	Use AdamW, not Adam with L2
L1	Absolute value penalty	Yes	$\lambda = 10^{-5}$ to $10^{-3}$	Feature selection, sparse linear models	Unstable with correlated features
Elastic net	L1 + L2 combined	Partial	$\alpha \in [0,1]$	Correlated features, genomics	Two hyperparameters to tune
Dropout	Random neuron masking during training	No	$p = 0.1$ to $0.5$	FC layers in neural networks	Interacts badly with batch norm
Early stopping	Stop training before convergence	No	Patience 5-20 epochs	Universal safety net	Implicitly L2-like for linear models
Data augmentation	Expand training set with transforms	No	Domain-specific	Vision, NLP	Must preserve label semantics
Batch normalization	Normalize activations using mini-batch stats	No	On for the layers that use it	Deep CNNs, some pre-transformer architectures	Not strictly regularization; transformers use LayerNorm/RMSNorm rather than BatchNorm because batch statistics are unreliable for variable-length sequences
Label smoothing	Soften one-hot targets to $(1-\epsilon, \epsilon/(K-1))$	No	$\epsilon = 0.1$	Classification with many classes	Prevents overconfident predictions

Each method addresses a different failure mode. L2 prevents large weights. Dropout prevents co-adaptation. Data augmentation prevents memorization of surface-level patterns. Combining orthogonal methods is more effective than increasing the strength of a single method.

Example

Regularization recipe for a ResNet on CIFAR-10

A typical ResNet-18 trained on CIFAR-10 uses all of the following simultaneously:

AdamW with weight decay $\lambda = 0.01$ (L2 on all non-bias parameters)
Random horizontal flips and random crops with 4-pixel padding (data augmentation)
Batch normalization after every convolutional layer (implicit regularization from noisy batch statistics)
Early stopping with patience 20 epochs monitoring validation loss
Label smoothing with $\epsilon = 0.1$

No dropout is used because batch normalization already provides noise injection. Each technique targets a different source of overfitting, and removing any one typically degrades test accuracy by 0.5 to 2 percentage points.

How to Choose

Start with this recipe and adjust based on validation performance:

Always use weight decay ( $\lambda = 10^{-4}$ for Adam/AdamW, $\lambda = 10^{-3}$ to $10^{-2}$ for SGD)
Add dropout ( $p = 0.1$ to $0.3$ ) in fully connected layers if overfitting persists
Use early stopping with patience of 5-10 epochs as a safety net
Add data augmentation appropriate to the domain
If still overfitting: get more data, reduce model size, or increase $\lambda$

Common Confusions

Watch Out

Weight decay and L2 regularization are not identical for Adam

For SGD, weight decay ( $\theta \leftarrow (1-\lambda)\theta - \eta g$ ) and L2 regularization ( $\theta \leftarrow \theta - \eta(g + \lambda\theta)$ ) produce the same update. For Adam, they differ because Adam scales the gradient by the second moment, but weight decay should not be scaled. AdamW implements the correct decoupled weight decay. Using Adam with L2 regularization (the weight_decay parameter in PyTorch's Adam optimizer) gives suboptimal results.

Watch Out

More regularization is not always better

Too much regularization underfits. A model with $\lambda = 10$ will have all weights near zero and make nearly constant predictions. Regularization strength must be tuned, typically via validation performance.

Exercises

ExerciseCore

Problem

You train a neural network with no regularization. Training accuracy is 99%, validation accuracy is 72%. Propose three regularization techniques to try and the order in which you would try them.

ExerciseAdvanced

Problem

Explain why early stopping with patience $T$ and learning rate $\eta$ is approximately equivalent to L2 regularization with $\lambda = 1/(\eta T)$ for linear regression with gradient descent. What assumption breaks this equivalence for neural networks?

ExerciseCore

Problem

Draw the constraint sets $\|\theta\|_1 \leq 1$ and $\|\theta\|_2 \leq 1$ in 2D. Explain geometrically why an elliptical loss contour is more likely to touch the L1 ball at a corner (axis) than a generic point.

ExerciseAdvanced

Problem

A transformer model is trained with Adam (not AdamW) and weight_decay=0.01. Explain why this does not implement proper weight decay. What is the effective regularization on a parameter whose second moment estimate $v_t$ is large?

Related Comparisons

Weight Decay vs. L2 Regularization

References

Canonical:

Bishop, Pattern Recognition and Machine Learning (2006), Chapter 3.1 (L2 regularization), Chapter 5.5 (regularization in neural networks)
Srivastava et al., "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" (2014), JMLR, Sections 1-7
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 3.4 (L1/Lasso), Chapter 3.4.2 (elastic net)

Current:

Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (2019), ICLR (AdamW, decoupled weight decay)
Gal & Ghahramani, "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning" (2016), ICML
Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 7 (regularization for deep learning)
Balestriero et al., "A Cookbook of Self-Supervised Learning" (2023), for data augmentation as regularization

Next Topics

Batch normalization: normalization as implicit regularization and training stabilizer
Data augmentation theory: formal analysis of augmentation as regularization

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Cross-Entropy Loss: MLE, KL Divergence, and Classificationlayer 1 · tier 1
Regularization Theorylayer 2 · tier 2

Derived topics

2

Batch Normalizationlayer 2 · tier 1
Data Augmentation Theorylayer 2 · tier 2

Graph-backed continuations

Batch Normalization Data Augmentation Theory