Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Training Techniques

Weight Initialization

Why initialization determines whether gradients vanish or explode, and how Xavier and He initialization preserve variance across layers.

CoreTier 1Stable~40 min

Why This Matters

explodingvanishingToo largeToo smallXavier/Helog activation magnitude-4-2024Layer depth13579

A neural network with bad initialization cannot train. If weights are too large, activations and gradients explode exponentially with depth. If weights are too small, they vanish exponentially. Proper initialization preserves gradient flow across layers. The entire field of deep learning was stuck on shallow networks for decades partly because of this problem. Xavier and He initialization solved it for standard architectures, and techniques like batch normalization further reduce sensitivity to the initial scale.

Mental Model

Consider a deep network as a chain of matrix multiplications. If each matrix has eigenvalues with magnitude greater than 1, the product grows exponentially. If less than 1, it shrinks exponentially. Good initialization sets the weight matrices so their effect on signal magnitude is approximately 1 per layer. The signal (activations in the forward pass, gradients in the backward pass) neither grows nor shrinks.

Formal Setup

Consider a feedforward network with LL layers and no activation function (for now):

al=Wlal1,l=1,,La_l = W_l a_{l-1}, \quad l = 1, \ldots, L

where WlRnl×nl1W_l \in \mathbb{R}^{n_l \times n_{l-1}} and a0=xa_0 = x is the input. The output is aL=WLWL1W1xa_L = W_L W_{L-1} \cdots W_1 x.

Definition

Variance Preservation Property

An initialization scheme satisfies variance preservation if, for each layer ll:

Var(al(j))=Var(al1(j))\text{Var}(a_l^{(j)}) = \text{Var}(a_{l-1}^{(j)})

where al(j)a_l^{(j)} denotes the jj-th component of the activation vector at layer ll. Equivalently, the signal magnitude stays constant across layers at initialization.

The Problem with Naive Initialization

If WlW_l has entries drawn i.i.d. from N(0,σ2)\mathcal{N}(0, \sigma^2) and al1a_{l-1} has zero-mean entries with variance vv, then:

Var(al(j))=nl1σ2v\text{Var}(a_l^{(j)}) = n_{l-1} \cdot \sigma^2 \cdot v

Each layer multiplies the variance by nl1σ2n_{l-1} \sigma^2. After LL layers:

Var(aL(j))(l=1Lnl1σ2)Var(x(j))\text{Var}(a_L^{(j)}) \propto \left(\prod_{l=1}^{L} n_{l-1} \sigma^2\right) \cdot \text{Var}(x^{(j)})

If nl1σ2>1n_{l-1} \sigma^2 > 1 for all layers, activations explode. If nl1σ2<1n_{l-1} \sigma^2 < 1, they vanish. For a network with 50 layers and nσ2=1.1n\sigma^2 = 1.1, the activation variance grows by 1.1501171.1^{50} \approx 117. At nσ2=0.9n\sigma^2 = 0.9, it shrinks by 0.9500.0050.9^{50} \approx 0.005.

Main Theorems

Theorem

Xavier/Glorot Initialization

Statement

For a layer with ninn_{\text{in}} input units and noutn_{\text{out}} output units, initializing weights as:

WijN(0,2nin+nout)W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)

preserves variance in both the forward pass (activations) and the backward pass (gradients) simultaneously, under the compromise that the forward pass condition requires Var(W)=1/nin\text{Var}(W) = 1/n_{\text{in}} and the backward pass requires Var(W)=1/nout\text{Var}(W) = 1/n_{\text{out}}.

Intuition

Forward pass: each activation is a sum of ninn_{\text{in}} terms. To keep variance at 1, each term should have variance 1/nin1/n_{\text{in}}, so Var(W)=1/nin\text{Var}(W) = 1/n_{\text{in}}. Backward pass: each gradient component is a sum of noutn_{\text{out}} terms, requiring Var(W)=1/nout\text{Var}(W) = 1/n_{\text{out}}. Xavier takes the harmonic mean of these two requirements: 2/(nin+nout)2/(n_{\text{in}} + n_{\text{out}}).

Proof Sketch

For the forward pass, al(j)=k=1ninWjkal1(k)a_l^{(j)} = \sum_{k=1}^{n_{\text{in}}} W_{jk} a_{l-1}^{(k)}. Since WW and aa are independent and zero-mean, Var(al(j))=ninVar(W)Var(al1)\text{Var}(a_l^{(j)}) = n_{\text{in}} \cdot \text{Var}(W) \cdot \text{Var}(a_{l-1}). Setting this equal to Var(al1)\text{Var}(a_{l-1}) gives Var(W)=1/nin\text{Var}(W) = 1/n_{\text{in}}. The backward pass derivation is symmetric with noutn_{\text{out}}.

Why It Matters

Xavier initialization enabled training of deep networks with sigmoid and tanh activations. Before Xavier, networks deeper than ~5 layers were considered impractical to train. The paper (Glorot and Bengio, 2010) was a turning point for deep learning.

Failure Mode

Xavier assumes linear or tanh activations near zero. ReLU sets half of its inputs to zero, which halves the effective fan-in. Xavier underestimates the required variance for ReLU networks, leading to signal decay. He initialization fixes this.

Theorem

He/Kaiming Initialization

Statement

For a layer with ninn_{\text{in}} input units followed by a ReLU activation, initializing weights as:

WijN(0,2nin)W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)

preserves the variance of activations across layers.

Intuition

ReLU zeros out negative inputs, keeping only the positive half. For a symmetric distribution, this halves the variance: Var(ReLU(z))=Var(z)/2\text{Var}(\text{ReLU}(z)) = \text{Var}(z)/2. To compensate, we double the weight variance compared to Xavier (using 2/nin2/n_{\text{in}} instead of 1/nin1/n_{\text{in}}).

Proof Sketch

Let z=k=1ninWjkal1(k)z = \sum_{k=1}^{n_{\text{in}}} W_{jk} a_{l-1}^{(k)}, so Var(z)=ninVar(W)Var(al1)\text{Var}(z) = n_{\text{in}} \cdot \text{Var}(W) \cdot \text{Var}(a_{l-1}). After ReLU: al=max(0,z)a_l = \max(0, z). For zN(0,σ2)z \sim \mathcal{N}(0, \sigma^2), Var(max(0,z))=σ2/2\text{Var}(\max(0, z)) = \sigma^2/2. Setting Var(al)=Var(al1)\text{Var}(a_l) = \text{Var}(a_{l-1}) gives ninVar(W)/2=1n_{\text{in}} \cdot \text{Var}(W) / 2 = 1, so Var(W)=2/nin\text{Var}(W) = 2/n_{\text{in}}.

Why It Matters

He initialization made it possible to train very deep ReLU networks (100+ layers). It was a key ingredient in the ResNet paper (He et al., 2015), which demonstrated that networks with 152 layers could train successfully.

Failure Mode

He initialization assumes standard ReLU. For leaky ReLU with slope α\alpha on the negative side, the correction factor is 2/(1+α2)2/(1 + \alpha^2) instead of 2. For GELU, SiLU, or other smooth activations, the exact correction differs but He initialization is still a reasonable starting point.

The Symmetry Breaking Argument

Proposition

Zero Initialization Fails

Statement

If all weights in a layer are initialized to the same value (including zero), then all neurons in that layer compute the same function, receive the same gradient, and remain identical throughout training. The layer effectively has only one neuron regardless of its width.

Intuition

At initialization, every neuron in a layer computes wTx+bw^T x + b with identical ww and bb. The outputs are identical, so the loss gradient with respect to each neuron's weights is identical. After the gradient update, all weights remain equal. This symmetry is never broken by gradient descent.

Proof Sketch

By induction on the training step. At step 0, all neurons in layer ll have weights ww and bias bb, so al(j)=wTal1+ba_l^{(j)} = w^T a_{l-1} + b for all jj. The gradient L/wl(j)\partial L / \partial w_l^{(j)} depends on al(j)a_l^{(j)} and L/al(j)\partial L / \partial a_l^{(j)}. Since all al(j)a_l^{(j)} are equal and the downstream computation treats all neurons symmetrically, all gradients are equal. The update wwηgw \leftarrow w - \eta g preserves equality.

Why It Matters

This is why random initialization is necessary, not optional. The randomness serves one purpose: break the symmetry between neurons so they can specialize to different features during training. The magnitude of the random initialization then determines signal propagation (Xavier/He), but the randomness itself is the mechanism for expressivity.

Failure Mode

Biases can safely be initialized to zero. They do not cause symmetry problems because different neurons in the same layer share the same bias value but have different weight vectors (due to random weight initialization). The bias just shifts the activation; it does not contribute to the symmetry between neurons. The exception is LSTM forget gate biases, which are typically initialized to 1 to encourage gradient flow at the start of training.

What Happens with All-Zero Initialization

Setting all weights to zero is the most catastrophic initialization. Beyond the symmetry problem, zero weights mean zero activations at every layer (for networks without bias). The gradient of the loss with respect to the weights is also zero (since activations are zero), so gradient descent makes no progress. The network is stuck at its initial state permanently.

With biases but zero weights, the network produces constant output bLb_L regardless of input. Gradients are nonzero but identical across neurons in each layer, so the symmetry is never broken.

Concrete Examples of Bad Initialization

Example

Variance explosion with naive initialization

Consider a 20-layer ReLU network with width 512 at every layer. Initialize weights as WijN(0,0.12)W_{ij} \sim \mathcal{N}(0, 0.1^2). The variance multiplier per layer is nσ2/2=512×0.01/2=2.56n \sigma^2 / 2 = 512 \times 0.01 / 2 = 2.56 (the factor of 2 accounts for ReLU zeroing negative inputs). After 20 layers: 2.56206.5×1082.56^{20} \approx 6.5 \times 10^8. An input with unit variance produces activations with variance around 10810^8 at the output layer. In float32, the activations overflow to infinity after about 14 layers. The loss is NaN on the first forward pass.

Reducing to σ=0.01\sigma = 0.01: the multiplier becomes 512×0.0001/2=0.0256512 \times 0.0001 / 2 = 0.0256. After 20 layers: 0.02562010320.0256^{20} \approx 10^{-32}. Activations vanish to zero. Gradients are effectively zero. Training makes no progress.

He initialization sets σ2=2/5120.0039\sigma^2 = 2/512 \approx 0.0039. The multiplier is 512×0.0039/2=1.0512 \times 0.0039 / 2 = 1.0. After 20 layers: 1.020=11.0^{20} = 1. Variance is preserved exactly.

Example

GPT-style transformer initialization

Large transformer models use a scaled initialization for residual stream contributions. In a transformer with LL layers, each layer adds to the residual stream. If each addition has variance σ2\sigma^2, the residual stream variance grows as Lσ2L\sigma^2 after LL layers. GPT-2 scales the output projection of each attention and MLP block by 1/2L1/\sqrt{2L}, so the total variance contribution from all LL layers is L×1/(2L)=1/2L \times 1/(2L) = 1/2. This keeps the residual stream variance bounded regardless of depth. Without this scaling, a 96-layer GPT-3 would have activations growing by a factor of 9610\sqrt{96} \approx 10 from residual accumulation alone.

Watch Out

Initialization interacts with residual connections

In a ResNet, the output of each block is x+F(x)x + F(x). If F(x)F(x) has variance vv and xx has variance VV, the output has variance V+vV + v (assuming independence). After LL blocks: variance is V0+LvV_0 + Lv. This linear growth (instead of the exponential growth without skip connections) makes ResNets much less sensitive to initialization. But for very deep ResNets (1000+ layers), even linear growth can cause problems. Fixup initialization and ReZero (initializing residual branches to zero) address this.

Xavier Derivation: Forward and Backward Conditions

The Xavier initialization variance 2/(nin+nout)2/(n_{\text{in}} + n_{\text{out}}) is a compromise between two conflicting requirements.

Forward pass condition. For al(j)=k=1ninWjkal1(k)a_l^{(j)} = \sum_{k=1}^{n_{\text{in}}} W_{jk} a_{l-1}^{(k)} with independent zero-mean weights and activations:

Var(al(j))=ninVar(W)Var(al1(k))\text{Var}(a_l^{(j)}) = n_{\text{in}} \cdot \text{Var}(W) \cdot \text{Var}(a_{l-1}^{(k)})

Preserving variance requires Var(W)=1/nin\text{Var}(W) = 1/n_{\text{in}}.

Backward pass condition. The gradient flows as δl1(k)=j=1noutWjkδl(j)\delta_{l-1}^{(k)} = \sum_{j=1}^{n_{\text{out}}} W_{jk} \delta_l^{(j)}, where δl=L/al\delta_l = \partial L / \partial a_l. By the same argument:

Var(δl1(k))=noutVar(W)Var(δl(j))\text{Var}(\delta_{l-1}^{(k)}) = n_{\text{out}} \cdot \text{Var}(W) \cdot \text{Var}(\delta_l^{(j)})

Preserving gradient variance requires Var(W)=1/nout\text{Var}(W) = 1/n_{\text{out}}.

These two conditions are incompatible unless nin=noutn_{\text{in}} = n_{\text{out}}. Xavier's solution is the harmonic mean: Var(W)=2/(nin+nout)\text{Var}(W) = 2/(n_{\text{in}} + n_{\text{out}}), which approximately preserves both forward and backward signal magnitudes. For layers where ninnoutn_{\text{in}} \approx n_{\text{out}} (common in practice), this is close to both 1/nin1/n_{\text{in}} and 1/nout1/n_{\text{out}}.

The uniform variant samples from U(a,a)U(-a, a) where a=6/(nin+nout)a = \sqrt{6/(n_{\text{in}} + n_{\text{out}})}, since a uniform on [a,a][-a, a] has variance a2/3a^2/3.

Signal Propagation Theory

Signal propagation theory generalizes the Xavier/He analysis to arbitrary architectures and activation functions. The central question: for a random input xx, what happens to the distribution of al(j)a_l^{(j)} as ll grows?

Define the mean field quantities:

ql=E[(al(j))2],cl=E[al(j)a~l(j)]qlq~lq_l = \mathbb{E}[(a_l^{(j)})^2], \quad c_l = \frac{\mathbb{E}[a_l^{(j)} \tilde{a}_l^{(j)}]}{\sqrt{q_l \tilde{q}_l}}

where ala_l and a~l\tilde{a}_l are activations from two different inputs. The quantity qlq_l tracks signal magnitude, and clc_l tracks the correlation between representations of different inputs.

For stable training, we need:

  1. qlq_l stays bounded and bounded away from zero (no explosion or collapse).
  2. clc_l does not converge to 1 (otherwise all inputs produce the same representation, and the network cannot distinguish them).

Both conditions place constraints on the weight variance σw2\sigma_w^2 and bias variance σb2\sigma_b^2. The critical line σw2E[ϕ(z)2]=1\sigma_w^2 \mathbb{E}[\phi'(z)^2] = 1 (where ϕ\phi is the activation function and zN(0,q)z \sim \mathcal{N}(0, q^*)) separates the ordered phase (signals collapse) from the chaotic phase (signals explode). Xavier and He initialization place the network near this critical line.

Connection to NTK Parameterization

The Neural Tangent Kernel (NTK) parameterization scales the output of each layer by 1/n1/\sqrt{n}, where nn is the layer width:

f(x)=1nLWLϕ(1nL1WL1ϕ())f(x) = \frac{1}{\sqrt{n_L}} W_L \phi\left(\frac{1}{\sqrt{n_{L-1}}} W_{L-1} \phi(\cdots)\right)

with Wl(ij)N(0,1)W_l^{(ij)} \sim \mathcal{N}(0, 1) (unit variance, not scaled). The 1/n1/\sqrt{n} factors are built into the architecture rather than the initialization. This is mathematically equivalent to He initialization (weights of variance 2/n2/n with the factor of 2 absorbed by the ReLU correction) but separates the scaling concern from the randomness.

The NTK parameterization is the standard in theoretical analyses of infinite-width networks, where the network's training dynamics converge to a kernel regression with a fixed kernel (the NTK). The practical consequence: understanding initialization in the NTK framework connects finite-width training dynamics to the well-understood theory of kernel methods.

Watch Out

NTK parameterization does not change the function class

The NTK parameterization f(x)=WLϕ(WL1ϕ())/nLf(x) = W_L \phi(W_{L-1} \phi(\cdots)) / \sqrt{n^L} and the standard parameterization with He-initialized weights represent the same set of functions. The difference is in how the learning rate scales with width. In the NTK parameterization, the gradient update Δf\Delta f has magnitude O(1)O(1) regardless of width, which makes the infinite-width limit well-defined. In the standard parameterization, you must scale the learning rate as η1/n\eta \propto 1/n to get the same behavior.

Connection to Random Matrix Theory

At initialization, the weight matrices W1,,WLW_1, \ldots, W_L are random matrices. The product WLW1W_L \cdots W_1 determines how signals propagate. Random matrix theory tells us that the singular values of a product of LL random matrices with appropriate scaling concentrate around 1. Specifically, for WlW_l with i.i.d. entries of variance 1/n1/n, the singular values of the product converge to a deterministic distribution as nn \to \infty.

The connection: Xavier and He initialization are choosing the variance so that the expected squared singular value of each factor is 1. This ensures the product neither grows nor shrinks, which is exactly the condition for stable forward and backward signal propagation.

Common Confusions

Watch Out

Xavier is not wrong for ReLU, it is suboptimal

Xavier initialization with ReLU networks does not cause immediate divergence. It causes gradual signal decay because each ReLU layer halves the variance that Xavier predicts will be preserved. For shallow networks (5-10 layers), the decay is mild. For deep networks (50+ layers), the effect compounds and training fails. He initialization is the correct fix.

Watch Out

Initialization is about the first step, not the entire training

Good initialization ensures stable gradients at step 0. Once training begins, the weights move away from their initial values. Batch normalization and residual connections help maintain stability throughout training, reducing (but not eliminating) the importance of initialization.

Key Takeaways

  • Bad initialization causes exponential growth or decay of activations and gradients across layers
  • Xavier: Var(W)=2/(nin+nout)\text{Var}(W) = 2/(n_{\text{in}} + n_{\text{out}}), designed for linear/tanh activations
  • He: Var(W)=2/nin\text{Var}(W) = 2/n_{\text{in}}, designed for ReLU (doubles Xavier to compensate for zeroing negative inputs)
  • Both derive from a single principle: preserve variance across layers
  • Random matrix theory explains why these scalings keep the singular values of weight products near 1

Exercises

ExerciseCore

Problem

A network has 10 hidden layers, each with 512 units and ReLU activations. Compute the activation variance at layer 10 relative to the input variance under (a) Xavier initialization and (b) He initialization.

ExerciseAdvanced

Problem

Derive the correct initialization variance for a layer using Leaky ReLU with negative slope α=0.01\alpha = 0.01. How much does it differ from standard He initialization?

ExerciseCore

Problem

A network has 3 hidden layers with widths 256, 128, 64 and uses tanh activations. Write the Xavier initialization variance for each layer's weight matrix.

ExerciseAdvanced

Problem

Explain why initializing all weights to the same nonzero constant (e.g., Wij=0.01W_{ij} = 0.01 for all i,ji, j) fails even though the weights are nonzero. How does this differ from initializing weights to a constant but biases randomly?

References

Canonical:

  • Glorot & Bengio, "Understanding the difficulty of training deep feedforward neural networks" (2010), AISTATS, Sections 1-4
  • He et al., "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification" (2015), ICCV, Section 2 (initialization derivation)
  • LeCun et al., "Efficient BackProp" (1998), in Neural Networks: Tricks of the Trade, Section 4.6 (weight initialization heuristics)

Current:

  • Jacot, Gabriel, Hongler, "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (2018), NeurIPS (NTK parameterization)
  • Pennington & Worah, "Nonlinear Random Matrix Theory for Deep Learning" (2017), NeurIPS (signal propagation analysis)
  • Schoenholz et al., "Deep Information Propagation" (2017), ICLR (mean field theory for deep networks, edge of chaos)
  • Huang et al., "Improving Transformer Training with Orthogonal Initialization" (2020)

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics