Dropout

Sneiderman, Robby

Training Techniques

Dropout

Dropout as stochastic regularization: Bernoulli masking, inverted dropout scaling, implicit ensemble interpretation, Bayesian connection via MC dropout, and equivalence to L2 regularization in linear models.

CoreTier 1StableSupporting~45 min

Prerequisites

Feedforward Networks and Backpropagation Common Probability Distributions

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

training-techniques | layer 2 | tier 1. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Click Apply Dropout to randomly mask hidden neurons. Input and output layers are never dropped.

Dropout (Srivastava et al. 2014) was one of the defining regularization techniques of the pre-transformer deep-learning era. Randomly zeroing out activations during training reduces overfitting substantially on vision and fully-connected architectures. In modern LLM pretraining it is used less frequently: weight decay, layer normalization, large-scale data, and architectural inductive biases carry most of the regularization load, and dropout in attention layers can interfere with learned patterns. It remains standard for smaller models, MLPs, and fine-tuning, and the theoretical ideas it introduced (implicit ensembles, noise injection, Bayesian interpretation via MC dropout) continue to appear throughout the field.

Mental Model

During each training step, every hidden unit is independently "dropped" (set to zero) with probability $1-p$ , where $p$ is the keep probability. This means each training step uses a different random sub-network. At test time, you use the full network but scale the activations to match the expected values during training — the result is approximately equivalent to averaging the predictions of exponentially many sub-networks.

The Dropout Procedure

Definition

Dropout (Training) $r_{j} \sim Bernoulli (p)$

During training, for a layer with activation vector $h \in \mathbb{R}^d$ :

Sample a binary mask $r \in \{0,1\}^d$ where each $r_j \sim \text{Bernoulli}(p)$ independently
Compute the masked activation $\tilde{h} = r \odot h$

where $\odot$ denotes elementwise multiplication. The keep probability $p$ is typically 0.5 for hidden layers and 0.8 for input layers.

Definition

Inverted Dropout

Inverted dropout scales the surviving activations by $1/p$ during training:

$\tilde{h} = \frac{1}{p} \cdot r \odot h$

This ensures $\mathbb{E}[\tilde{h}] = h$ , so at test time you use the network unchanged (no scaling needed). This is the standard implementation in all modern frameworks. The alternative, scaling by $p$ at test time, is mathematically equivalent but less convenient.

Why Dropout Works

1. Implicit Ensemble Interpretation

Theorem

Dropout as Ensemble Averaging

Statement

A network with $d$ hidden units and dropout creates an implicit ensemble of $2^d$ sub-networks (one for each binary mask pattern). Each sub-network shares weights with the full network. At test time, the (scaled) full network computes a geometric average of the predictions of all $2^d$ sub-networks.

For a single-layer network with softmax output, the test-time prediction is exactly the geometric mean of the sub-network predictions:

$p_{\text{test}}(y \mid x) \propto \exp\left(\frac{1}{2^d}\sum_{r \in \{0,1\}^d} \log p_r(y \mid x)\right)$

Intuition

Each training step optimizes a random sub-network. The sub-networks share weights, so they are correlated but not identical. At test time, using the full network with scaled weights approximately averages their predictions. Ensembles reduce variance, so this averaging reduces overfitting.

Proof Sketch

Each mask $r$ defines a sub-network with output $f_r(x) = W_2(\text{diag}(r) \cdot \sigma(W_1 x))$ . The test-time network uses $\mathbb{E}[r] = p \cdot \mathbf{1}$ , giving $f_{\text{test}}(x) = W_2(p \cdot \sigma(W_1 x))$ . For linear activations, $f_{\text{test}} = \mathbb{E}_r[f_r(x)]$ exactly. For nonlinear activations, this is an approximation (the "weight scaling inference rule"). Baldi and Sadowski (2013) showed this is exact for the geometric mean in the softmax case.

Why It Matters

This explains why dropout prevents co-adaptation of features. No hidden unit can rely on any other specific unit being present, since any unit might be dropped. This forces each unit to learn independently useful features, leading to more robust representations.

Failure Mode

The ensemble interpretation is approximate for multi-layer networks with nonlinear activations. The "weight scaling inference rule" (using $p \cdot w$ at test time) is exact only for single hidden layers. For deep networks, it is an approximation whose quality degrades with depth.

report a correction →

2. Noise Injection as Regularization

Dropout injects multiplicative Bernoulli noise into activations. For a hidden unit $h_j$ , the noisy version is $\tilde{h}_j = \frac{r_j}{p} h_j$ where $r_j \sim \text{Bernoulli}(p)$ .

The variance of this noise is:

$\text{Var}\left[\frac{r_j}{p}\right] = \frac{1}{p^2}\text{Var}[r_j] = \frac{1}{p^2} \cdot p(1-p) = \frac{1-p}{p}$

This multiplicative noise has a regularization effect analogous to adding a data-dependent penalty to the loss. Units with large activations receive proportionally larger noise, penalizing large activation magnitudes.

3. Bayesian Connection: MC Dropout

Gal and Ghahramani (2016) showed that a network trained with dropout can be interpreted as an approximate Bayesian neural network. At test time, instead of using the full network, you run multiple forward passes with dropout active and average the predictions:

$\hat{p}(y \mid x) = \frac{1}{T}\sum_{t=1}^T p(y \mid x, r_t)$

This is Monte Carlo dropout. The variance across forward passes gives an estimate of model uncertainty. This is one of the simplest methods for uncertainty quantification in deep learning.

The Key Equivalence

Theorem

Dropout in Linear Models is L2 Regularization

Statement

For a linear model $y = w^T x$ with squared loss and dropout applied to the input with keep probability $p$ , the expected training loss under dropout is:

$\mathbb{E}_r[\ell_{\text{dropout}}] = \|y - Xw\|^2 + \frac{1-p}{p}\, w^\top \mathrm{diag}(X^\top X)\, w$

Equivalently, the dropout penalty is the weighted $\ell_2$ regularizer $\frac{1-p}{p}\sum_j w_j^2 \,(X^\top X)_{jj} = \frac{1-p}{p}\sum_j w_j^2 \sum_i x_{ij}^2$ , i.e.\ each weight is penalized in proportion to the empirical squared norm of its feature column.

When all features have the same column norm (e.g.\ $\sum_i x_{ij}^2 = c$ for all $j$ , which holds after standardization), this collapses to plain ridge:

$\mathbb{E}_r[\ell_{\text{dropout}}] = \|y - Xw\|^2 + \frac{1-p}{p}\, c\, \|w\|^2,$

which is ridge regression with $\lambda = c\,(1-p)/p$ .

Intuition

Randomly zeroing out inputs (and scaling by $1/p$ ) adds noise proportional to $w_j^2$ . In expectation, this noise acts like a penalty on large weights. For the linear case, it is exactly L2 regularization. For $p = 0.5$ , the implicit $\lambda$ is 1.

Proof Sketch

With inverted dropout on input $x$ , the noisy input is $\tilde{x}_j = \frac{r_j}{p} x_j$ where $r_j \sim \text{Bernoulli}(p)$ . The noisy prediction is $w^T \tilde{x}$ .

$\mathbb{E}[\tilde{x}_j] = x_j$ and $\text{Var}[\tilde{x}_j] = \frac{1-p}{p} x_j^2$ .

$\mathbb{E}[(y - w^T \tilde{x})^2] = (y - w^T x)^2 + \text{Var}[w^T \tilde{x}]$

$= (y - w^T x)^2 + \sum_j w_j^2 \cdot \frac{1-p}{p} x_j^2$

Summing over samples gives the stated result.

Why It Matters

This provides a precise characterization of dropout as a regularizer in the simplest setting. It shows that dropout strength is controlled by $p$ : smaller $p$ (more dropout) means stronger regularization. For nonlinear networks, the equivalence is approximate but the intuition carries over.

Failure Mode

The exact equivalence to L2 holds only for linear models with squared loss. In deep nonlinear networks, dropout induces a more complex, data-dependent regularizer that does not reduce to simple weight decay. The regularization effect interacts with the network architecture in ways that are not fully understood theoretically.

report a correction →

Common Confusions

Watch Out

Dropout zeros activations, not weights

Dropout sets hidden unit activations to zero, not the weight parameters themselves. The weights remain; they are just multiplied by a zero activation during the forward pass. Zeroing weights would be a different (destructive) operation. The gradient update still applies to the weights of dropped units, but the gradient through a zeroed activation is zero by the chain rule, so in practice those weights receive no gradient signal for that step.

Watch Out

Dropout is sometimes called dilution, but the terms are not interchangeable

Some older literature (particularly in statistical physics and ensemble methods) uses "dilution" to describe randomly removing connections or units from a network. In that context, "model dilution" refers to thinning a network by removing structure. Dropout in the Srivastava et al. (2014) sense is a specific training procedure: stochastic masking during training with inverted scaling, full network at test time. The term "dilution" is broader and less precise. If someone calls dropout "model dilution," they are describing the effect (a thinned sub-network), not the full procedure (training with random masks, scaling, and the implicit ensemble property). Use "dropout" when you mean the Srivastava training procedure. Use "dilution" only when discussing the general concept of removing network components.

Watch Out

Dropout at test time is MC dropout, not standard dropout

Standard practice: dropout is OFF at test time (use the full network with scaled weights). MC dropout: dropout is ON at test time, and you average multiple stochastic forward passes. These are different procedures with different purposes. Standard dropout gives a point prediction; MC dropout gives a distribution over predictions for uncertainty estimation.

Summary

Dropout: independently zero each activation with probability $1-p$
Inverted dropout: scale surviving activations by $1/p$ during training
Implicit ensemble: averages $2^d$ sub-networks (approximately)
Prevents co-adaptation: no unit can rely on specific other units
MC dropout: keep dropout on at test time for uncertainty estimates
Linear model + squared loss: dropout with rate $p$ = L2 penalty with $\lambda = (1-p)/p$

Exercises

ExerciseCore

Problem

Show that for a linear model $y = w^T x$ with squared loss and inverted dropout applied to the input with keep probability $p = 0.5$ , the expected loss over the dropout mask equals the ridge regression objective with $\lambda = 1$ . Assume features have unit variance.

ExerciseAdvanced

Problem

Why does dropout work poorly with batch normalization? What is the tension between the two techniques?

Related Comparisons

Dropout vs. Batch Normalization

References

Canonical:

Srivastava et al., "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" (JMLR, 2014). The original paper with extensive experiments.
Hinton et al., "Improving neural networks by preventing co-adaptation of feature detectors" (2012). The original proposal (shorter, earlier version).

Theory:

Wager, Wang, Liang, "Dropout Training as Adaptive Regularization" (NeurIPS, 2013). Proves dropout on linear models is equivalent to adaptive L2 regularization.
Baldi & Sadowski, "Understanding Dropout" (NeurIPS, 2013). Geometric mean of sub-networks equals the full network for linear models.
Gal & Ghahramani, "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning" (ICML, 2016). MC dropout for uncertainty estimation.

Extensions:

Wan et al., "Regularization of Neural Networks using DropConnect" (ICML, 2013). Drops weights instead of activations.
Ma et al., "Dropout as a Low-Rank Regularizer for Matrix Factorization" (AISTATS, 2017). Formal connection between dropout and nuclear norm regularization.
Ghiasi, Lin, Le, "DropBlock: A Regularization Method for Convolutional Networks" (NeurIPS, 2018). Dropping contiguous regions instead of individual neurons.

Next Topics

Dropout connects to several advanced topics:

Bayesian neural networks: MC dropout as approximate Bayesian inference
Regularization theory: the broader landscape of implicit and explicit regularizers
Batch normalization: another training stabilizer that interacts with dropout in complex ways

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Common Probability Distributionslayer 0A · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.