Training Techniques
Regularization in Practice
Practical guide to regularization: L2 (weight decay), L1 (sparsity), dropout, early stopping, data augmentation, and batch normalization. How each constrains model complexity and how to choose between them.
Prerequisites
Why This Matters
A model that memorizes the training set is useless. Regularization is the toolkit for preventing this. Every production ML model uses at least one form of regularization, and most use several simultaneously. The theoretical foundations are covered in the regularization theory topic. This page focuses on practical application: what each technique does, when to use it, and how to combine them.
Mental Model
All regularization techniques share one goal: reduce the gap between training performance and test performance (the bias-variance tradeoff). They do this by constraining the effective complexity of the model, either explicitly (penalty terms, architectural constraints) or implicitly (noise injection, early termination).
L2 Regularization (Weight Decay)
L2 Regularization
Add a penalty proportional to the squared L2 norm of the parameters to the loss:
where controls the penalty strength.
L2 Regularization Shrinks Weights
Statement
With L2 regularization, the gradient descent update becomes:
Each step multiplies the weights by before applying the gradient update. Weights that are not reinforced by the gradient shrink exponentially toward zero.
Intuition
L2 regularization applies a constant friction to all weights. Large weights are penalized more (the penalty is quadratic), so the optimizer prefers solutions with many small weights over solutions with a few large weights. This prevents the model from relying too heavily on any single feature.
Proof Sketch
Differentiate with respect to : . Substitute into the gradient descent update rule and factor out .
Why It Matters
L2 regularization is the most widely used explicit regularizer. In PyTorch,
the weight_decay parameter in optimizers implements exactly this. A typical
value is to .
Failure Mode
L2 regularization shrinks all weights toward zero but never sets them exactly to zero. If you need sparse models (feature selection), L2 is the wrong choice. Use L1 instead. Also, for Adam, naive L2 regularization and weight decay are not equivalent. Use AdamW (decoupled weight decay) for correct behavior.
L1 Regularization (Sparsity)
L1 Regularization
Add a penalty proportional to the L1 norm of the parameters:
L1 regularization drives small weights to exactly zero, producing sparse models. This is useful when you believe many features are irrelevant and want the model to automatically select a subset.
Why L1 Gives Sparsity: The Geometry
The geometric explanation is clearer than the algebraic one. Consider minimizing subject to the constraint (equivalent to the penalized form for some ). The L1 ball is a diamond (rhombus in 2D, cross-polytope in higher dimensions) with corners on the coordinate axes. The loss function has elliptical contours (for quadratic losses). The constrained optimum occurs where the loss contour first touches the constraint set. Because the L1 ball has sharp corners on the axes, the contact point typically lands on a corner, which means one or more coordinates are exactly zero.
Contrast with L2: the L2 ball is a sphere. Elliptical contours meet a sphere at a point that is generically not on any axis. L2 shrinks all weights but sets none to zero.
The subgradient argument makes this precise: the subgradient of at is any value in . The optimality condition is satisfied at whenever . For L2, the gradient of at zero is , which provides no thresholding effect.
Typical values: to for neural networks. L1 is more commonly used in linear models (Lasso) than in deep learning.
Elastic Net
Elastic Net Regularization
The elastic net combines L1 and L2 penalties:
where controls sparsity and controls weight magnitude.
Elastic net solves a specific failure mode of pure L1: when features are correlated, L1 tends to select one and zero out the rest. The L2 term groups correlated features together. In linear models, the elastic net penalty is , where interpolates between ridge () and lasso ().
For deep learning, elastic net is less common because the model already has enough capacity to distribute weight across correlated features. It is standard in linear models and frequently used in genomics and high-dimensional statistics.
Dropout
Dropout
During training, randomly set each neuron's output to zero with probability (typically 0.1 to 0.5). During inference, use all neurons but scale outputs by (or equivalently, scale training outputs by using inverted dropout).
Dropout as Approximate Model Averaging
Statement
A neural network with dropout-eligible neurons implicitly defines sub-networks (one per possible binary mask). Training with dropout approximately trains all sub-networks with shared weights. At inference, the scaled full network approximates the geometric mean of all sub-network predictions:
where the sum is over all masks .
Intuition
Dropout forces the network to learn redundant representations. No neuron can rely on any other specific neuron being present, so each neuron must be useful on its own. This prevents co-adaptation (neurons specializing only in combination with specific other neurons).
Proof Sketch
Each training step samples a mask and trains the corresponding sub-network. The weight sharing means updates to one sub-network affect all sub-networks that share those weights. At inference, the weight-scaled full network gives the same expected output as the mean over sub-networks for linear models, and approximates it for non-linear models.
Why It Matters
Dropout is the most common regularizer for neural networks after weight decay. It adds zero computational cost during training (just masking operations) and is trivially simple to implement.
Failure Mode
Dropout is less effective (and sometimes harmful) with batch normalization. The interaction between dropout noise and batch statistics can destabilize training. In practice, many modern architectures (ResNets, Transformers) use weight decay and batch/layer normalization instead of dropout, except in fully connected layers.
Early Stopping
Early Stopping
Monitor validation loss during training. Stop training when validation loss has not improved for a specified number of epochs (the patience). Return the model parameters from the epoch with the lowest validation loss.
Early stopping works because training loss decreases monotonically (with enough data and small enough learning rate), but validation loss eventually increases as the model overfits. The point of divergence is the optimal stopping time.
Early Stopping as Implicit Regularization
Early stopping is equivalent to L2 regularization in certain settings.
Early Stopping and L2 Regularization Equivalence
Statement
For linear regression with gradient descent starting from and learning rate , the iterate after steps matches the L2-regularized solution with penalty in the following sense: both suppress the -th eigendirection of by a factor that depends on the eigenvalue . Early stopping gives suppression , while L2 gives .
Intuition
Gradient descent with limited steps cannot fully fit directions with small eigenvalues (slow-learning directions). L2 regularization penalizes large weights, which also suppresses directions with small eigenvalues. Both methods effectively truncate the low-variance components of the model. The implicit regularization strength is inversely proportional to the number of training steps.
Proof Sketch
In the eigenbasis of , the -th component of evolves as , where is the OLS solution. The L2 solution gives . For small, , and matching the suppression factors gives .
Why It Matters
This equivalence explains why early stopping is effective even without an explicit penalty term. It also explains why the number of training steps is a regularization hyperparameter: fewer steps means stronger regularization.
Failure Mode
The equivalence is exact only for quadratic losses with linear models. For neural networks, the loss landscape is non-quadratic, the Hessian changes during training, and the implicit bias of SGD interacts with the stopping time in ways that do not reduce to L2 regularization.
Data Augmentation as Implicit Regularization
Data augmentation (random crops, flips, rotations for images; synonym replacement for text) expands the effective training set. This is implicit regularization because it constrains the model to be invariant to the augmentation transformations.
A model trained on randomly cropped images cannot memorize pixel-level details of training images, because those details change every epoch. The model is forced to learn features that are robust to small spatial perturbations.
Batch Normalization as Implicit Regularization
Batch normalization normalizes activations using batch statistics (mean and variance computed over the mini-batch). The mini-batch statistics are noisy estimates of the population statistics, and this noise acts as a regularizer. Larger batch sizes reduce this noise, which is one reason large-batch training sometimes generalizes worse.
Weight Decay vs L2 Regularization in Adaptive Optimizers
For SGD, weight decay and L2 regularization produce identical updates:
For Adam and other adaptive methods, they diverge. Adam divides the gradient by the running estimate of its second moment . With L2 regularization, the penalty gradient is also divided by , which reduces the effective regularization on parameters with large gradients. With decoupled weight decay (AdamW), the shrinkage is applied before the adaptive step, so all parameters are regularized equally regardless of gradient magnitude.
The practical difference: Adam with L2 regularization under-regularizes parameters in frequently updated directions and over-regularizes parameters in rarely updated directions. AdamW avoids this asymmetry. For transformer training, AdamW with weight decay to is the standard choice.
Dropout as Approximate Bayesian Inference
Dropout is not just ensembling
The ensemble interpretation (averaging sub-networks) is the standard explanation. A deeper connection: Gal and Ghahramani (2016) showed that a network trained with dropout is approximately performing variational inference over the weights. Running dropout at test time and averaging predictions approximates the posterior predictive distribution. This gives calibrated uncertainty estimates: the variance of predictions across dropout masks estimates the model's epistemic uncertainty. This technique is called MC Dropout.
The approximation holds when dropout is applied before every weight layer (not just FC layers) and the variational distribution is a mixture of two point masses (zero and the learned weight). The quality of the approximation degrades for very deep networks and small dropout rates.
Practical Guidelines for Regularization Strength
Choosing (or dropout rate , or patience for early stopping) is a model selection problem. There is no closed-form answer. The standard approach:
- Grid search or random search over on a log scale: .
- Monitor the train-val gap: if the gap is large, increase regularization. If both train and val loss are high, decrease regularization (you are underfitting).
- Scale with model size: larger models typically need stronger regularization. For transformers, weight decay of is common; for small CNNs, to .
- Combine orthogonal regularizers: weight decay constrains weight magnitude, dropout prevents co-adaptation, data augmentation enforces invariances. These address different failure modes and compose well.
- Do not regularize biases: the standard practice is to exclude bias terms from weight decay. Biases have one parameter per neuron and contribute negligibly to overfitting. Regularizing them can harm optimization by restricting the network's ability to shift activation distributions.
Comparison Table of Regularization Methods
| Method | Mechanism | Sparsity? | Typical strength | Best for | Watch out |
|---|---|---|---|---|---|
| L2 (weight decay) | Quadratic penalty on weight magnitude | No | to | Default for all models | Use AdamW, not Adam with L2 |
| L1 | Absolute value penalty | Yes | to | Feature selection, sparse linear models | Unstable with correlated features |
| Elastic net | L1 + L2 combined | Partial | Correlated features, genomics | Two hyperparameters to tune | |
| Dropout | Random neuron masking during training | No | to | FC layers in neural networks | Interacts badly with batch norm |
| Early stopping | Stop training before convergence | No | Patience 5-20 epochs | Universal safety net | Implicitly L2-like for linear models |
| Data augmentation | Expand training set with transforms | No | Domain-specific | Vision, NLP | Must preserve label semantics |
| Batch normalization | Normalize activations using mini-batch stats | No | Always on | Deep CNNs, transformers | Not strictly regularization; noisy batch stats provide implicit regularization |
| Label smoothing | Soften one-hot targets to | No | Classification with many classes | Prevents overconfident predictions |
Each method addresses a different failure mode. L2 prevents large weights. Dropout prevents co-adaptation. Data augmentation prevents memorization of surface-level patterns. Combining orthogonal methods is more effective than increasing the strength of a single method.
Regularization recipe for a ResNet on CIFAR-10
A typical ResNet-18 trained on CIFAR-10 uses all of the following simultaneously:
- AdamW with weight decay (L2 on all non-bias parameters)
- Random horizontal flips and random crops with 4-pixel padding (data augmentation)
- Batch normalization after every convolutional layer (implicit regularization from noisy batch statistics)
- Early stopping with patience 20 epochs monitoring validation loss
- Label smoothing with
No dropout is used because batch normalization already provides noise injection. Each technique targets a different source of overfitting, and removing any one typically degrades test accuracy by 0.5 to 2 percentage points.
How to Choose
Start with this recipe and adjust based on validation performance:
- Always use weight decay ( for Adam/AdamW, to for SGD)
- Add dropout ( to ) in fully connected layers if overfitting persists
- Use early stopping with patience of 5-10 epochs as a safety net
- Add data augmentation appropriate to the domain
- If still overfitting: get more data, reduce model size, or increase
Common Confusions
Weight decay and L2 regularization are not identical for Adam
For SGD, weight decay () and
L2 regularization ()
produce the same update. For Adam, they differ because Adam scales the
gradient by the second moment, but weight decay should not be scaled. AdamW
implements the correct decoupled weight decay. Using Adam with L2
regularization (the weight_decay parameter in PyTorch's Adam optimizer)
gives suboptimal results.
More regularization is not always better
Too much regularization underfits. A model with will have all weights near zero and make nearly constant predictions. Regularization strength must be tuned, typically via validation performance.
Exercises
Problem
You train a neural network with no regularization. Training accuracy is 99%, validation accuracy is 72%. Propose three regularization techniques to try and the order in which you would try them.
Problem
Explain why early stopping with patience and learning rate is approximately equivalent to L2 regularization with for linear regression with gradient descent. What assumption breaks this equivalence for neural networks?
Problem
Draw the constraint sets and in 2D. Explain geometrically why an elliptical loss contour is more likely to touch the L1 ball at a corner (axis) than a generic point.
Problem
A transformer model is trained with Adam (not AdamW) and weight_decay=0.01. Explain why this does not implement proper weight decay. What is the effective regularization on a parameter whose second moment estimate is large?
Related Comparisons
References
Canonical:
- Bishop, Pattern Recognition and Machine Learning (2006), Chapter 3.1 (L2 regularization), Chapter 5.5 (regularization in neural networks)
- Srivastava et al., "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" (2014), JMLR, Sections 1-7
- Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 3.4 (L1/Lasso), Chapter 3.4.2 (elastic net)
Current:
- Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (2019), ICLR (AdamW, decoupled weight decay)
- Gal & Ghahramani, "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning" (2016), ICML
- Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 7 (regularization for deep learning)
- Balestriero et al., "A Cookbook of Self-Supervised Learning" (2023), for data augmentation as regularization
Next Topics
- Batch normalization: normalization as implicit regularization and training stabilizer
- Data augmentation theory: formal analysis of augmentation as regularization
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Regularization TheoryLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Bias-Variance TradeoffLayer 2
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Common Probability DistributionsLayer 0A
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1