Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Training Techniques

Regularization in Practice

Practical guide to regularization: L2 (weight decay), L1 (sparsity), dropout, early stopping, data augmentation, and batch normalization. How each constrains model complexity and how to choose between them.

CoreTier 1Stable~45 min

Why This Matters

A model that memorizes the training set is useless. Regularization is the toolkit for preventing this. Every production ML model uses at least one form of regularization, and most use several simultaneously. The theoretical foundations are covered in the regularization theory topic. This page focuses on practical application: what each technique does, when to use it, and how to combine them.

Mental Model

All regularization techniques share one goal: reduce the gap between training performance and test performance (the bias-variance tradeoff). They do this by constraining the effective complexity of the model, either explicitly (penalty terms, architectural constraints) or implicitly (noise injection, early termination).

L2 Regularization (Weight Decay)

Definition

L2 Regularization

Add a penalty proportional to the squared L2 norm of the parameters to the loss:

Lreg(θ)=L(θ)+λ2θ22L_{\text{reg}}(\theta) = L(\theta) + \frac{\lambda}{2} \|\theta\|_2^2

where λ>0\lambda > 0 controls the penalty strength.

Proposition

L2 Regularization Shrinks Weights

Statement

With L2 regularization, the gradient descent update becomes:

θt+1=θtη(L(θt)+λθt)=(1ηλ)θtηL(θt)\theta_{t+1} = \theta_t - \eta(\nabla L(\theta_t) + \lambda \theta_t) = (1 - \eta\lambda)\theta_t - \eta \nabla L(\theta_t)

Each step multiplies the weights by (1ηλ)<1(1 - \eta\lambda) < 1 before applying the gradient update. Weights that are not reinforced by the gradient shrink exponentially toward zero.

Intuition

L2 regularization applies a constant friction to all weights. Large weights are penalized more (the penalty is quadratic), so the optimizer prefers solutions with many small weights over solutions with a few large weights. This prevents the model from relying too heavily on any single feature.

Proof Sketch

Differentiate LregL_{\text{reg}} with respect to θ\theta: Lreg=L+λθ\nabla L_{\text{reg}} = \nabla L + \lambda \theta. Substitute into the gradient descent update rule and factor out θt\theta_t.

Why It Matters

L2 regularization is the most widely used explicit regularizer. In PyTorch, the weight_decay parameter in optimizers implements exactly this. A typical value is λ=104\lambda = 10^{-4} to 10210^{-2}.

Failure Mode

L2 regularization shrinks all weights toward zero but never sets them exactly to zero. If you need sparse models (feature selection), L2 is the wrong choice. Use L1 instead. Also, for Adam, naive L2 regularization and weight decay are not equivalent. Use AdamW (decoupled weight decay) for correct behavior.

L1 Regularization (Sparsity)

Definition

L1 Regularization

Add a penalty proportional to the L1 norm of the parameters:

Lreg(θ)=L(θ)+λθ1=L(θ)+λjθjL_{\text{reg}}(\theta) = L(\theta) + \lambda \|\theta\|_1 = L(\theta) + \lambda \sum_j |\theta_j|

L1 regularization drives small weights to exactly zero, producing sparse models. This is useful when you believe many features are irrelevant and want the model to automatically select a subset.

Why L1 Gives Sparsity: The Geometry

The geometric explanation is clearer than the algebraic one. Consider minimizing L(θ)L(\theta) subject to the constraint θ1t\|\theta\|_1 \leq t (equivalent to the penalized form for some λ\lambda). The L1 ball θ1t\|\theta\|_1 \leq t is a diamond (rhombus in 2D, cross-polytope in higher dimensions) with corners on the coordinate axes. The loss function has elliptical contours (for quadratic losses). The constrained optimum occurs where the loss contour first touches the constraint set. Because the L1 ball has sharp corners on the axes, the contact point typically lands on a corner, which means one or more coordinates are exactly zero.

Contrast with L2: the L2 ball θ2t\|\theta\|_2 \leq t is a sphere. Elliptical contours meet a sphere at a point that is generically not on any axis. L2 shrinks all weights but sets none to zero.

The subgradient argument makes this precise: the subgradient of θj|\theta_j| at θj=0\theta_j = 0 is any value in [1,1][-1, 1]. The optimality condition 0L/θj+λ[1,1]0 \in \partial L / \partial \theta_j + \lambda \cdot [-1, 1] is satisfied at θj=0\theta_j = 0 whenever L/θjλ|\partial L / \partial \theta_j| \leq \lambda. For L2, the gradient of θj2\theta_j^2 at zero is 00, which provides no thresholding effect.

Typical values: λ=105\lambda = 10^{-5} to 10310^{-3} for neural networks. L1 is more commonly used in linear models (Lasso) than in deep learning.

Elastic Net

Definition

Elastic Net Regularization

The elastic net combines L1 and L2 penalties:

Lreg(θ)=L(θ)+λ1θ1+λ2θ22L_{\text{reg}}(\theta) = L(\theta) + \lambda_1 \|\theta\|_1 + \lambda_2 \|\theta\|_2^2

where λ1\lambda_1 controls sparsity and λ2\lambda_2 controls weight magnitude.

Elastic net solves a specific failure mode of pure L1: when features are correlated, L1 tends to select one and zero out the rest. The L2 term groups correlated features together. In linear models, the elastic net penalty is αθ1+(1α)θ22/2\alpha \|\theta\|_1 + (1-\alpha)\|\theta\|_2^2/2, where α[0,1]\alpha \in [0,1] interpolates between ridge (α=0\alpha = 0) and lasso (α=1\alpha = 1).

For deep learning, elastic net is less common because the model already has enough capacity to distribute weight across correlated features. It is standard in linear models and frequently used in genomics and high-dimensional statistics.

Dropout

Definition

Dropout

During training, randomly set each neuron's output to zero with probability pp (typically 0.1 to 0.5). During inference, use all neurons but scale outputs by (1p)(1 - p) (or equivalently, scale training outputs by 1/(1p)1/(1-p) using inverted dropout).

Proposition

Dropout as Approximate Model Averaging

Statement

A neural network with dd dropout-eligible neurons implicitly defines 2d2^d sub-networks (one per possible binary mask). Training with dropout approximately trains all 2d2^d sub-networks with shared weights. At inference, the scaled full network approximates the geometric mean of all sub-network predictions:

pensemble(yx)12dmpm(yx)p_{\text{ensemble}}(y|x) \approx \frac{1}{2^d} \sum_{m} p_m(y|x)

where the sum is over all masks mm.

Intuition

Dropout forces the network to learn redundant representations. No neuron can rely on any other specific neuron being present, so each neuron must be useful on its own. This prevents co-adaptation (neurons specializing only in combination with specific other neurons).

Proof Sketch

Each training step samples a mask mBernoulli(1p)dm \sim \text{Bernoulli}(1-p)^d and trains the corresponding sub-network. The weight sharing means updates to one sub-network affect all sub-networks that share those weights. At inference, the weight-scaled full network gives the same expected output as the mean over sub-networks for linear models, and approximates it for non-linear models.

Why It Matters

Dropout is the most common regularizer for neural networks after weight decay. It adds zero computational cost during training (just masking operations) and is trivially simple to implement.

Failure Mode

Dropout is less effective (and sometimes harmful) with batch normalization. The interaction between dropout noise and batch statistics can destabilize training. In practice, many modern architectures (ResNets, Transformers) use weight decay and batch/layer normalization instead of dropout, except in fully connected layers.

Early Stopping

Definition

Early Stopping

Monitor validation loss during training. Stop training when validation loss has not improved for a specified number of epochs (the patience). Return the model parameters from the epoch with the lowest validation loss.

Early stopping works because training loss decreases monotonically (with enough data and small enough learning rate), but validation loss eventually increases as the model overfits. The point of divergence is the optimal stopping time.

Early Stopping as Implicit Regularization

Early stopping is equivalent to L2 regularization in certain settings.

Proposition

Early Stopping and L2 Regularization Equivalence

Statement

For linear regression with gradient descent starting from θ0=0\theta_0 = 0 and learning rate η\eta, the iterate θt\theta_t after tt steps matches the L2-regularized solution with penalty λ=1/(ηt)\lambda = 1/(\eta t) in the following sense: both suppress the kk-th eigendirection of XTXX^TX by a factor that depends on the eigenvalue σk\sigma_k. Early stopping gives suppression (1ησk)t(1 - \eta\sigma_k)^t, while L2 gives σk/(σk+λ)\sigma_k/(\sigma_k + \lambda).

Intuition

Gradient descent with limited steps cannot fully fit directions with small eigenvalues (slow-learning directions). L2 regularization penalizes large weights, which also suppresses directions with small eigenvalues. Both methods effectively truncate the low-variance components of the model. The implicit regularization strength is inversely proportional to the number of training steps.

Proof Sketch

In the eigenbasis of XTXX^TX, the kk-th component of θt\theta_t evolves as θt(k)=(1(1ησk)t)θ(k)\theta_t^{(k)} = (1 - (1-\eta\sigma_k)^t) \theta_*^{(k)}, where θ\theta_* is the OLS solution. The L2 solution gives θ^(k)=σk/(σk+λ)θ(k)\hat{\theta}^{(k)} = \sigma_k/(\sigma_k + \lambda) \cdot \theta_*^{(k)}. For ησk\eta\sigma_k small, (1ησk)teησkt(1-\eta\sigma_k)^t \approx e^{-\eta\sigma_k t}, and matching the suppression factors gives λ1/(ηt)\lambda \approx 1/(\eta t).

Why It Matters

This equivalence explains why early stopping is effective even without an explicit penalty term. It also explains why the number of training steps is a regularization hyperparameter: fewer steps means stronger regularization.

Failure Mode

The equivalence is exact only for quadratic losses with linear models. For neural networks, the loss landscape is non-quadratic, the Hessian changes during training, and the implicit bias of SGD interacts with the stopping time in ways that do not reduce to L2 regularization.

Data Augmentation as Implicit Regularization

Data augmentation (random crops, flips, rotations for images; synonym replacement for text) expands the effective training set. This is implicit regularization because it constrains the model to be invariant to the augmentation transformations.

A model trained on randomly cropped images cannot memorize pixel-level details of training images, because those details change every epoch. The model is forced to learn features that are robust to small spatial perturbations.

Batch Normalization as Implicit Regularization

Batch normalization normalizes activations using batch statistics (mean and variance computed over the mini-batch). The mini-batch statistics are noisy estimates of the population statistics, and this noise acts as a regularizer. Larger batch sizes reduce this noise, which is one reason large-batch training sometimes generalizes worse.

Weight Decay vs L2 Regularization in Adaptive Optimizers

For SGD, weight decay and L2 regularization produce identical updates:

θ(1ηλ)θηL(θ)=θη(L(θ)+λθ)\theta \leftarrow (1 - \eta\lambda)\theta - \eta \nabla L(\theta) = \theta - \eta(\nabla L(\theta) + \lambda\theta)

For Adam and other adaptive methods, they diverge. Adam divides the gradient by the running estimate of its second moment vt\sqrt{v_t}. With L2 regularization, the penalty gradient λθ\lambda\theta is also divided by vt\sqrt{v_t}, which reduces the effective regularization on parameters with large gradients. With decoupled weight decay (AdamW), the shrinkage (1λ)θ(1 - \lambda)\theta is applied before the adaptive step, so all parameters are regularized equally regardless of gradient magnitude.

The practical difference: Adam with L2 regularization under-regularizes parameters in frequently updated directions and over-regularizes parameters in rarely updated directions. AdamW avoids this asymmetry. For transformer training, AdamW with weight decay λ=0.01\lambda = 0.01 to 0.10.1 is the standard choice.

Dropout as Approximate Bayesian Inference

Watch Out

Dropout is not just ensembling

The ensemble interpretation (averaging 2d2^d sub-networks) is the standard explanation. A deeper connection: Gal and Ghahramani (2016) showed that a network trained with dropout is approximately performing variational inference over the weights. Running dropout at test time and averaging predictions approximates the posterior predictive distribution. This gives calibrated uncertainty estimates: the variance of predictions across dropout masks estimates the model's epistemic uncertainty. This technique is called MC Dropout.

The approximation holds when dropout is applied before every weight layer (not just FC layers) and the variational distribution is a mixture of two point masses (zero and the learned weight). The quality of the approximation degrades for very deep networks and small dropout rates.

Practical Guidelines for Regularization Strength

Choosing λ\lambda (or dropout rate pp, or patience for early stopping) is a model selection problem. There is no closed-form answer. The standard approach:

  1. Grid search or random search over λ\lambda on a log scale: λ{105,104,103,102,101}\lambda \in \{10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}\}.
  2. Monitor the train-val gap: if the gap is large, increase regularization. If both train and val loss are high, decrease regularization (you are underfitting).
  3. Scale with model size: larger models typically need stronger regularization. For transformers, weight decay of 0.10.1 is common; for small CNNs, 10410^{-4} to 10310^{-3}.
  4. Combine orthogonal regularizers: weight decay constrains weight magnitude, dropout prevents co-adaptation, data augmentation enforces invariances. These address different failure modes and compose well.
  5. Do not regularize biases: the standard practice is to exclude bias terms from weight decay. Biases have one parameter per neuron and contribute negligibly to overfitting. Regularizing them can harm optimization by restricting the network's ability to shift activation distributions.

Comparison Table of Regularization Methods

MethodMechanismSparsity?Typical strengthBest forWatch out
L2 (weight decay)Quadratic penalty on weight magnitudeNoλ=104\lambda = 10^{-4} to 10110^{-1}Default for all modelsUse AdamW, not Adam with L2
L1Absolute value penaltyYesλ=105\lambda = 10^{-5} to 10310^{-3}Feature selection, sparse linear modelsUnstable with correlated features
Elastic netL1 + L2 combinedPartialα[0,1]\alpha \in [0,1]Correlated features, genomicsTwo hyperparameters to tune
DropoutRandom neuron masking during trainingNop=0.1p = 0.1 to 0.50.5FC layers in neural networksInteracts badly with batch norm
Early stoppingStop training before convergenceNoPatience 5-20 epochsUniversal safety netImplicitly L2-like for linear models
Data augmentationExpand training set with transformsNoDomain-specificVision, NLPMust preserve label semantics
Batch normalizationNormalize activations using mini-batch statsNoAlways onDeep CNNs, transformersNot strictly regularization; noisy batch stats provide implicit regularization
Label smoothingSoften one-hot targets to (1ϵ,ϵ/(K1))(1-\epsilon, \epsilon/(K-1))Noϵ=0.1\epsilon = 0.1Classification with many classesPrevents overconfident predictions

Each method addresses a different failure mode. L2 prevents large weights. Dropout prevents co-adaptation. Data augmentation prevents memorization of surface-level patterns. Combining orthogonal methods is more effective than increasing the strength of a single method.

Example

Regularization recipe for a ResNet on CIFAR-10

A typical ResNet-18 trained on CIFAR-10 uses all of the following simultaneously:

  1. AdamW with weight decay λ=0.01\lambda = 0.01 (L2 on all non-bias parameters)
  2. Random horizontal flips and random crops with 4-pixel padding (data augmentation)
  3. Batch normalization after every convolutional layer (implicit regularization from noisy batch statistics)
  4. Early stopping with patience 20 epochs monitoring validation loss
  5. Label smoothing with ϵ=0.1\epsilon = 0.1

No dropout is used because batch normalization already provides noise injection. Each technique targets a different source of overfitting, and removing any one typically degrades test accuracy by 0.5 to 2 percentage points.

How to Choose

Start with this recipe and adjust based on validation performance:

  1. Always use weight decay (λ=104\lambda = 10^{-4} for Adam/AdamW, λ=103\lambda = 10^{-3} to 10210^{-2} for SGD)
  2. Add dropout (p=0.1p = 0.1 to 0.30.3) in fully connected layers if overfitting persists
  3. Use early stopping with patience of 5-10 epochs as a safety net
  4. Add data augmentation appropriate to the domain
  5. If still overfitting: get more data, reduce model size, or increase λ\lambda

Common Confusions

Watch Out

Weight decay and L2 regularization are not identical for Adam

For SGD, weight decay (θ(1λ)θηg\theta \leftarrow (1-\lambda)\theta - \eta g) and L2 regularization (θθη(g+λθ)\theta \leftarrow \theta - \eta(g + \lambda\theta)) produce the same update. For Adam, they differ because Adam scales the gradient by the second moment, but weight decay should not be scaled. AdamW implements the correct decoupled weight decay. Using Adam with L2 regularization (the weight_decay parameter in PyTorch's Adam optimizer) gives suboptimal results.

Watch Out

More regularization is not always better

Too much regularization underfits. A model with λ=10\lambda = 10 will have all weights near zero and make nearly constant predictions. Regularization strength must be tuned, typically via validation performance.

Exercises

ExerciseCore

Problem

You train a neural network with no regularization. Training accuracy is 99%, validation accuracy is 72%. Propose three regularization techniques to try and the order in which you would try them.

ExerciseAdvanced

Problem

Explain why early stopping with patience TT and learning rate η\eta is approximately equivalent to L2 regularization with λ=1/(ηT)\lambda = 1/(\eta T) for linear regression with gradient descent. What assumption breaks this equivalence for neural networks?

ExerciseCore

Problem

Draw the constraint sets θ11\|\theta\|_1 \leq 1 and θ21\|\theta\|_2 \leq 1 in 2D. Explain geometrically why an elliptical loss contour is more likely to touch the L1 ball at a corner (axis) than a generic point.

ExerciseAdvanced

Problem

A transformer model is trained with Adam (not AdamW) and weight_decay=0.01. Explain why this does not implement proper weight decay. What is the effective regularization on a parameter whose second moment estimate vtv_t is large?

Related Comparisons

References

Canonical:

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapter 3.1 (L2 regularization), Chapter 5.5 (regularization in neural networks)
  • Srivastava et al., "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" (2014), JMLR, Sections 1-7
  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 3.4 (L1/Lasso), Chapter 3.4.2 (elastic net)

Current:

  • Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (2019), ICLR (AdamW, decoupled weight decay)
  • Gal & Ghahramani, "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning" (2016), ICML
  • Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 7 (regularization for deep learning)
  • Balestriero et al., "A Cookbook of Self-Supervised Learning" (2023), for data augmentation as regularization

Next Topics

  • Batch normalization: normalization as implicit regularization and training stabilizer
  • Data augmentation theory: formal analysis of augmentation as regularization

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics