Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Training Techniques

Dropout

Dropout as stochastic regularization: Bernoulli masking, inverted dropout scaling, implicit ensemble interpretation, Bayesian connection via MC dropout, and equivalence to L2 regularization in linear models.

CoreTier 1Stable~45 min

Why This Matters

InputHidden 1Hidden 2Output
Click Apply Dropout to randomly mask hidden neurons. Input and output layers are never dropped.

Dropout (Srivastava et al. 2014) was one of the defining regularization techniques of the pre-transformer deep-learning era. Randomly zeroing out activations during training reduces overfitting substantially on vision and fully-connected architectures. In modern LLM pretraining it is used less frequently: weight decay, layer normalization, large-scale data, and architectural inductive biases carry most of the regularization load, and dropout in attention layers can interfere with learned patterns. It remains standard for smaller models, MLPs, and fine-tuning, and the theoretical ideas it introduced (implicit ensembles, noise injection, Bayesian interpretation via MC dropout) continue to appear throughout the field.

Mental Model

During each training step, every hidden unit is independently "dropped" (set to zero) with probability 1p1-p, where pp is the keep probability. This means each training step uses a different random sub-network. At test time, you use the full network but scale the activations to match the expected values during training. The result is approximately equivalent to averaging the predictions of exponentially many sub-networks.

The Dropout Procedure

Definition

Dropout (Training)

During training, for a layer with activation vector hRdh \in \mathbb{R}^d:

  1. Sample a binary mask r{0,1}dr \in \{0,1\}^d where each rjBernoulli(p)r_j \sim \text{Bernoulli}(p) independently
  2. Compute the masked activation h~=rh\tilde{h} = r \odot h

where \odot denotes elementwise multiplication. The keep probability pp is typically 0.5 for hidden layers and 0.8 for input layers.

Definition

Inverted Dropout

Inverted dropout scales the surviving activations by 1/p1/p during training:

h~=1prh\tilde{h} = \frac{1}{p} \cdot r \odot h

This ensures E[h~]=h\mathbb{E}[\tilde{h}] = h, so at test time you use the network unchanged (no scaling needed). This is the standard implementation in all modern frameworks. The alternative, scaling by pp at test time, is mathematically equivalent but less convenient.

Why Dropout Works

1. Implicit Ensemble Interpretation

Theorem

Dropout as Ensemble Averaging

Statement

A network with dd hidden units and dropout creates an implicit ensemble of 2d2^d sub-networks (one for each binary mask pattern). Each sub-network shares weights with the full network. At test time, the (scaled) full network computes a geometric average of the predictions of all 2d2^d sub-networks.

For a single-layer network with softmax output, the test-time prediction is exactly the geometric mean of the sub-network predictions:

ptest(yx)exp(12dr{0,1}dlogpr(yx))p_{\text{test}}(y \mid x) \propto \exp\left(\frac{1}{2^d}\sum_{r \in \{0,1\}^d} \log p_r(y \mid x)\right)

Intuition

Each training step optimizes a random sub-network. The sub-networks share weights, so they are correlated but not identical. At test time, using the full network with scaled weights approximately averages their predictions. Ensembles reduce variance, so this averaging reduces overfitting.

Proof Sketch

Each mask rr defines a sub-network with output fr(x)=W2(diag(r)σ(W1x))f_r(x) = W_2(\text{diag}(r) \cdot \sigma(W_1 x)). The test-time network uses E[r]=p1\mathbb{E}[r] = p \cdot \mathbf{1}, giving ftest(x)=W2(pσ(W1x))f_{\text{test}}(x) = W_2(p \cdot \sigma(W_1 x)). For linear activations, ftest=Er[fr(x)]f_{\text{test}} = \mathbb{E}_r[f_r(x)] exactly. For nonlinear activations, this is an approximation (the "weight scaling inference rule"). Baldi and Sadowski (2013) showed this is exact for the geometric mean in the softmax case.

Why It Matters

This explains why dropout prevents co-adaptation of features. No hidden unit can rely on any other specific unit being present, since any unit might be dropped. This forces each unit to learn independently useful features, leading to more robust representations.

Failure Mode

The ensemble interpretation is approximate for multi-layer networks with nonlinear activations. The "weight scaling inference rule" (using pwp \cdot w at test time) is exact only for single hidden layers. For deep networks, it is an approximation whose quality degrades with depth.

2. Noise Injection as Regularization

Dropout injects multiplicative Bernoulli noise into activations. For a hidden unit hjh_j, the noisy version is h~j=rjphj\tilde{h}_j = \frac{r_j}{p} h_j where rjBernoulli(p)r_j \sim \text{Bernoulli}(p).

The variance of this noise is:

Var[rjp]=1p2Var[rj]=1p2p(1p)=1pp\text{Var}\left[\frac{r_j}{p}\right] = \frac{1}{p^2}\text{Var}[r_j] = \frac{1}{p^2} \cdot p(1-p) = \frac{1-p}{p}

This multiplicative noise has a regularization effect analogous to adding a data-dependent penalty to the loss. Units with large activations receive proportionally larger noise, penalizing large activation magnitudes.

3. Bayesian Connection: MC Dropout

Gal and Ghahramani (2016) showed that a network trained with dropout can be interpreted as an approximate Bayesian neural network. At test time, instead of using the full network, you run multiple forward passes with dropout active and average the predictions:

p^(yx)=1Tt=1Tp(yx,rt)\hat{p}(y \mid x) = \frac{1}{T}\sum_{t=1}^T p(y \mid x, r_t)

This is Monte Carlo dropout. The variance across forward passes gives an estimate of model uncertainty. This is one of the simplest methods for uncertainty quantification in deep learning.

The Key Equivalence

Theorem

Dropout in Linear Models is L2 Regularization

Statement

For a linear model y=wTxy = w^T x with squared loss and dropout applied to the input with keep probability pp, the expected training loss under dropout is:

Er[dropout]=yXw2+1ppwdiag(XTX)2\mathbb{E}_r[\ell_{\text{dropout}}] = \|y - Xw\|^2 + \frac{1-p}{p}\|w \odot \text{diag}(X^T X)\|^2

When the features have unit variance, this simplifies to:

Er[dropout]=yXw2+1ppw2\mathbb{E}_r[\ell_{\text{dropout}}] = \|y - Xw\|^2 + \frac{1-p}{p}\|w\|^2

which is ridge regression with λ=1pp\lambda = \frac{1-p}{p}.

Intuition

Randomly zeroing out inputs (and scaling by 1/p1/p) adds noise proportional to wj2w_j^2. In expectation, this noise acts like a penalty on large weights. For the linear case, it is exactly L2 regularization. For p=0.5p = 0.5, the implicit λ\lambda is 1.

Proof Sketch

With inverted dropout on input xx, the noisy input is x~j=rjpxj\tilde{x}_j = \frac{r_j}{p} x_j where rjBernoulli(p)r_j \sim \text{Bernoulli}(p). The noisy prediction is wTx~w^T \tilde{x}.

E[x~j]=xj\mathbb{E}[\tilde{x}_j] = x_j and Var[x~j]=1ppxj2\text{Var}[\tilde{x}_j] = \frac{1-p}{p} x_j^2.

E[(ywTx~)2]=(ywTx)2+Var[wTx~]\mathbb{E}[(y - w^T \tilde{x})^2] = (y - w^T x)^2 + \text{Var}[w^T \tilde{x}]

=(ywTx)2+jwj21ppxj2= (y - w^T x)^2 + \sum_j w_j^2 \cdot \frac{1-p}{p} x_j^2

Summing over samples gives the stated result.

Why It Matters

This provides a precise characterization of dropout as a regularizer in the simplest setting. It shows that dropout strength is controlled by pp: smaller pp (more dropout) means stronger regularization. For nonlinear networks, the equivalence is approximate but the intuition carries over.

Failure Mode

The exact equivalence to L2 holds only for linear models with squared loss. In deep nonlinear networks, dropout induces a more complex, data-dependent regularizer that does not reduce to simple weight decay. The regularization effect interacts with the network architecture in ways that are not fully understood theoretically.

Common Confusions

Watch Out

Dropout zeros activations, not weights

Dropout sets hidden unit activations to zero, not the weight parameters themselves. The weights remain; they are just multiplied by a zero activation during the forward pass. Zeroing weights would be a different (destructive) operation. The gradient update still applies to the weights of dropped units, but the gradient through a zeroed activation is zero by the chain rule, so in practice those weights receive no gradient signal for that step.

Watch Out

Dropout is sometimes called dilution, but the terms are not interchangeable

Some older literature (particularly in statistical physics and ensemble methods) uses "dilution" to describe randomly removing connections or units from a network. In that context, "model dilution" refers to thinning a network by removing structure. Dropout in the Srivastava et al. (2014) sense is a specific training procedure: stochastic masking during training with inverted scaling, full network at test time. The term "dilution" is broader and less precise. If someone calls dropout "model dilution," they are describing the effect (a thinned sub-network), not the full procedure (training with random masks, scaling, and the implicit ensemble property). Use "dropout" when you mean the Srivastava training procedure. Use "dilution" only when discussing the general concept of removing network components.

Watch Out

Dropout at test time is MC dropout, not standard dropout

Standard practice: dropout is OFF at test time (use the full network with scaled weights). MC dropout: dropout is ON at test time, and you average multiple stochastic forward passes. These are different procedures with different purposes. Standard dropout gives a point prediction; MC dropout gives a distribution over predictions for uncertainty estimation.

Summary

  • Dropout: independently zero each activation with probability 1p1-p
  • Inverted dropout: scale surviving activations by 1/p1/p during training
  • Implicit ensemble: averages 2d2^d sub-networks (approximately)
  • Prevents co-adaptation: no unit can rely on specific other units
  • MC dropout: keep dropout on at test time for uncertainty estimates
  • Linear model + squared loss: dropout with rate pp = L2 penalty with λ=(1p)/p\lambda = (1-p)/p

Exercises

ExerciseCore

Problem

Show that for a linear model y=wTxy = w^T x with squared loss and inverted dropout applied to the input with keep probability p=0.5p = 0.5, the expected loss over the dropout mask equals the ridge regression objective with λ=1\lambda = 1. Assume features have unit variance.

ExerciseAdvanced

Problem

Why does dropout work poorly with batch normalization? What is the tension between the two techniques?

Related Comparisons

References

Canonical:

  • Srivastava et al., "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" (JMLR, 2014). The original paper with extensive experiments.
  • Hinton et al., "Improving neural networks by preventing co-adaptation of feature detectors" (2012). The original proposal (shorter, earlier version).

Theory:

  • Wager, Wang, Liang, "Dropout Training as Adaptive Regularization" (NeurIPS, 2013). Proves dropout on linear models is equivalent to adaptive L2 regularization.
  • Baldi & Sadowski, "Understanding Dropout" (NeurIPS, 2013). Geometric mean of sub-networks equals the full network for linear models.
  • Gal & Ghahramani, "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning" (ICML, 2016). MC dropout for uncertainty estimation.

Extensions:

  • Wan et al., "Regularization of Neural Networks using DropConnect" (ICML, 2013). Drops weights instead of activations.
  • Ma et al., "Dropout as a Low-Rank Regularizer for Matrix Factorization" (AISTATS, 2017). Formal connection between dropout and nuclear norm regularization.
  • Ghiasi, Lin, Le, "DropBlock: A Regularization Method for Convolutional Networks" (NeurIPS, 2018). Dropping contiguous regions instead of individual neurons.

Next Topics

Dropout connects to several advanced topics:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.