Training Techniques
Dropout
Dropout as stochastic regularization: Bernoulli masking, inverted dropout scaling, implicit ensemble interpretation, Bayesian connection via MC dropout, and equivalence to L2 regularization in linear models.
Why This Matters
Dropout (Srivastava et al. 2014) was one of the defining regularization techniques of the pre-transformer deep-learning era. Randomly zeroing out activations during training reduces overfitting substantially on vision and fully-connected architectures. In modern LLM pretraining it is used less frequently: weight decay, layer normalization, large-scale data, and architectural inductive biases carry most of the regularization load, and dropout in attention layers can interfere with learned patterns. It remains standard for smaller models, MLPs, and fine-tuning, and the theoretical ideas it introduced (implicit ensembles, noise injection, Bayesian interpretation via MC dropout) continue to appear throughout the field.
Mental Model
During each training step, every hidden unit is independently "dropped" (set to zero) with probability , where is the keep probability. This means each training step uses a different random sub-network. At test time, you use the full network but scale the activations to match the expected values during training. The result is approximately equivalent to averaging the predictions of exponentially many sub-networks.
The Dropout Procedure
Dropout (Training)
During training, for a layer with activation vector :
- Sample a binary mask where each independently
- Compute the masked activation
where denotes elementwise multiplication. The keep probability is typically 0.5 for hidden layers and 0.8 for input layers.
Inverted Dropout
Inverted dropout scales the surviving activations by during training:
This ensures , so at test time you use the network unchanged (no scaling needed). This is the standard implementation in all modern frameworks. The alternative, scaling by at test time, is mathematically equivalent but less convenient.
Why Dropout Works
1. Implicit Ensemble Interpretation
Dropout as Ensemble Averaging
Statement
A network with hidden units and dropout creates an implicit ensemble of sub-networks (one for each binary mask pattern). Each sub-network shares weights with the full network. At test time, the (scaled) full network computes a geometric average of the predictions of all sub-networks.
For a single-layer network with softmax output, the test-time prediction is exactly the geometric mean of the sub-network predictions:
Intuition
Each training step optimizes a random sub-network. The sub-networks share weights, so they are correlated but not identical. At test time, using the full network with scaled weights approximately averages their predictions. Ensembles reduce variance, so this averaging reduces overfitting.
Proof Sketch
Each mask defines a sub-network with output . The test-time network uses , giving . For linear activations, exactly. For nonlinear activations, this is an approximation (the "weight scaling inference rule"). Baldi and Sadowski (2013) showed this is exact for the geometric mean in the softmax case.
Why It Matters
This explains why dropout prevents co-adaptation of features. No hidden unit can rely on any other specific unit being present, since any unit might be dropped. This forces each unit to learn independently useful features, leading to more robust representations.
Failure Mode
The ensemble interpretation is approximate for multi-layer networks with nonlinear activations. The "weight scaling inference rule" (using at test time) is exact only for single hidden layers. For deep networks, it is an approximation whose quality degrades with depth.
2. Noise Injection as Regularization
Dropout injects multiplicative Bernoulli noise into activations. For a hidden unit , the noisy version is where .
The variance of this noise is:
This multiplicative noise has a regularization effect analogous to adding a data-dependent penalty to the loss. Units with large activations receive proportionally larger noise, penalizing large activation magnitudes.
3. Bayesian Connection: MC Dropout
Gal and Ghahramani (2016) showed that a network trained with dropout can be interpreted as an approximate Bayesian neural network. At test time, instead of using the full network, you run multiple forward passes with dropout active and average the predictions:
This is Monte Carlo dropout. The variance across forward passes gives an estimate of model uncertainty. This is one of the simplest methods for uncertainty quantification in deep learning.
The Key Equivalence
Dropout in Linear Models is L2 Regularization
Statement
For a linear model with squared loss and dropout applied to the input with keep probability , the expected training loss under dropout is:
When the features have unit variance, this simplifies to:
which is ridge regression with .
Intuition
Randomly zeroing out inputs (and scaling by ) adds noise proportional to . In expectation, this noise acts like a penalty on large weights. For the linear case, it is exactly L2 regularization. For , the implicit is 1.
Proof Sketch
With inverted dropout on input , the noisy input is where . The noisy prediction is .
and .
Summing over samples gives the stated result.
Why It Matters
This provides a precise characterization of dropout as a regularizer in the simplest setting. It shows that dropout strength is controlled by : smaller (more dropout) means stronger regularization. For nonlinear networks, the equivalence is approximate but the intuition carries over.
Failure Mode
The exact equivalence to L2 holds only for linear models with squared loss. In deep nonlinear networks, dropout induces a more complex, data-dependent regularizer that does not reduce to simple weight decay. The regularization effect interacts with the network architecture in ways that are not fully understood theoretically.
Common Confusions
Dropout zeros activations, not weights
Dropout sets hidden unit activations to zero, not the weight parameters themselves. The weights remain; they are just multiplied by a zero activation during the forward pass. Zeroing weights would be a different (destructive) operation. The gradient update still applies to the weights of dropped units, but the gradient through a zeroed activation is zero by the chain rule, so in practice those weights receive no gradient signal for that step.
Dropout is sometimes called dilution, but the terms are not interchangeable
Some older literature (particularly in statistical physics and ensemble methods) uses "dilution" to describe randomly removing connections or units from a network. In that context, "model dilution" refers to thinning a network by removing structure. Dropout in the Srivastava et al. (2014) sense is a specific training procedure: stochastic masking during training with inverted scaling, full network at test time. The term "dilution" is broader and less precise. If someone calls dropout "model dilution," they are describing the effect (a thinned sub-network), not the full procedure (training with random masks, scaling, and the implicit ensemble property). Use "dropout" when you mean the Srivastava training procedure. Use "dilution" only when discussing the general concept of removing network components.
Dropout at test time is MC dropout, not standard dropout
Standard practice: dropout is OFF at test time (use the full network with scaled weights). MC dropout: dropout is ON at test time, and you average multiple stochastic forward passes. These are different procedures with different purposes. Standard dropout gives a point prediction; MC dropout gives a distribution over predictions for uncertainty estimation.
Summary
- Dropout: independently zero each activation with probability
- Inverted dropout: scale surviving activations by during training
- Implicit ensemble: averages sub-networks (approximately)
- Prevents co-adaptation: no unit can rely on specific other units
- MC dropout: keep dropout on at test time for uncertainty estimates
- Linear model + squared loss: dropout with rate = L2 penalty with
Exercises
Problem
Show that for a linear model with squared loss and inverted dropout applied to the input with keep probability , the expected loss over the dropout mask equals the ridge regression objective with . Assume features have unit variance.
Problem
Why does dropout work poorly with batch normalization? What is the tension between the two techniques?
Related Comparisons
References
Canonical:
- Srivastava et al., "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" (JMLR, 2014). The original paper with extensive experiments.
- Hinton et al., "Improving neural networks by preventing co-adaptation of feature detectors" (2012). The original proposal (shorter, earlier version).
Theory:
- Wager, Wang, Liang, "Dropout Training as Adaptive Regularization" (NeurIPS, 2013). Proves dropout on linear models is equivalent to adaptive L2 regularization.
- Baldi & Sadowski, "Understanding Dropout" (NeurIPS, 2013). Geometric mean of sub-networks equals the full network for linear models.
- Gal & Ghahramani, "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning" (ICML, 2016). MC dropout for uncertainty estimation.
Extensions:
- Wan et al., "Regularization of Neural Networks using DropConnect" (ICML, 2013). Drops weights instead of activations.
- Ma et al., "Dropout as a Low-Rank Regularizer for Matrix Factorization" (AISTATS, 2017). Formal connection between dropout and nuclear norm regularization.
- Ghiasi, Lin, Le, "DropBlock: A Regularization Method for Convolutional Networks" (NeurIPS, 2018). Dropping contiguous regions instead of individual neurons.
Next Topics
Dropout connects to several advanced topics:
- Bayesian neural networks: MC dropout as approximate Bayesian inference
- Regularization theory: the broader landscape of implicit and explicit regularizers
- Batch normalization: another training stabilizer that interacts with dropout in complex ways
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Common Probability DistributionsLayer 0A