Weight Initialization

A neural network with bad initialization cannot train. If weights are too large, activations and gradients explode exponentially with depth. If weights are too small, they vanish exponentially. Proper initialization preserves gradient flow across layers. The entire field of deep learning was stuck on shallow networks for decades partly because of this problem. Xavier and He initialization solved it for standard architectures, and techniques like batch normalization further reduce sensitivity to the initial scale.

Mental Model

Consider a deep network as a chain of matrix multiplications. If each matrix has eigenvalues with magnitude greater than 1, the product grows exponentially. If less than 1, it shrinks exponentially. Good initialization sets the weight matrices so their effect on signal magnitude is approximately 1 per layer. The signal (activations in the forward pass, gradients in the backward pass) neither grows nor shrinks.

Formal Setup

Consider a feedforward network with $L$ layers and no activation function (for now):

$a_l = W_l a_{l-1}, \quad l = 1, \ldots, L$

where $W_l \in \mathbb{R}^{n_l \times n_{l-1}}$ and $a_0 = x$ is the input. The output is $a_L = W_L W_{L-1} \cdots W_1 x$ .

Definition

Variance Preservation Property

An initialization scheme satisfies variance preservation if, for each layer $l$ :

$\text{Var}(a_l^{(j)}) = \text{Var}(a_{l-1}^{(j)})$

where $a_l^{(j)}$ denotes the $j$ -th component of the activation vector at layer $l$ . Equivalently, the signal magnitude stays constant across layers at initialization.

The Problem with Naive Initialization

If $W_l$ has entries drawn i.i.d. from $\mathcal{N}(0, \sigma^2)$ and $a_{l-1}$ has zero-mean entries with variance $v$ , then:

$\text{Var}(a_l^{(j)}) = n_{l-1} \cdot \sigma^2 \cdot v$

Each layer multiplies the variance by $n_{l-1} \sigma^2$ . After $L$ layers:

$\text{Var}(a_L^{(j)}) \propto \left(\prod_{l=1}^{L} n_{l-1} \sigma^2\right) \cdot \text{Var}(x^{(j)})$

If $n_{l-1} \sigma^2 > 1$ for all layers, activations explode. If $n_{l-1} \sigma^2 < 1$ , they vanish. For a network with 50 layers and $n\sigma^2 = 1.1$ , the activation variance grows by $1.1^{50} \approx 117$ . At $n\sigma^2 = 0.9$ , it shrinks by $0.9^{50} \approx 0.005$ .

Main Theorems

Theorem

Xavier/Glorot Initialization

Statement

For a layer with $n_{\text{in}}$ input units and $n_{\text{out}}$ output units, initializing weights as:

$W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)$

preserves variance in both the forward pass (activations) and the backward pass (gradients) simultaneously, under the compromise that the forward pass condition requires $\text{Var}(W) = 1/n_{\text{in}}$ and the backward pass requires $\text{Var}(W) = 1/n_{\text{out}}$ .

Intuition

Forward pass: each activation is a sum of $n_{\text{in}}$ terms. To keep variance at 1, each term should have variance $1/n_{\text{in}}$ , so $\text{Var}(W) = 1/n_{\text{in}}$ . Backward pass: each gradient component is a sum of $n_{\text{out}}$ terms, requiring $\text{Var}(W) = 1/n_{\text{out}}$ . Xavier takes the harmonic mean of these two requirements: $2/(n_{\text{in}} + n_{\text{out}})$ .

Proof Sketch

For the forward pass, $a_l^{(j)} = \sum_{k=1}^{n_{\text{in}}} W_{jk} a_{l-1}^{(k)}$ . Since $W$ and $a$ are independent and zero-mean, $\text{Var}(a_l^{(j)}) = n_{\text{in}} \cdot \text{Var}(W) \cdot \text{Var}(a_{l-1})$ . Setting this equal to $\text{Var}(a_{l-1})$ gives $\text{Var}(W) = 1/n_{\text{in}}$ . The backward pass derivation is symmetric with $n_{\text{out}}$ .

Why It Matters

Xavier initialization enabled training of deep networks with sigmoid and tanh activations. Before Xavier, networks deeper than ~5 layers were considered impractical to train. The paper (Glorot and Bengio, 2010) was a turning point for deep learning.

Failure Mode

Xavier assumes linear or tanh activations near zero. ReLU sets half of its inputs to zero, which halves the effective fan-in. Xavier underestimates the required variance for ReLU networks, leading to signal decay. He initialization fixes this.

report a correction →

Theorem

He/Kaiming Initialization

Statement

For a layer with $n_{\text{in}}$ input units followed by a ReLU activation, initializing weights as:

$W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)$

preserves the variance of activations across layers.

Intuition

ReLU zeros out negative inputs, keeping only the positive half. For a zero-mean symmetric input $z$ , this halves the second moment: $\mathbb{E}[\text{ReLU}(z)^2] = \mathbb{E}[z^2]/2$ . The He derivation propagates second moments rather than variances (the pre-activations at the next layer are again zero-mean because weights are zero-mean independent). To compensate for this halving, we double the weight variance compared to Xavier (using $2/n_{\text{in}}$ instead of $1/n_{\text{in}}$ ).

Proof Sketch

Let $z = \sum_{k=1}^{n_{\text{in}}} W_{jk} a_{l-1}^{(k)}$ , so $\mathbb{E}[z^2] = n_{\text{in}} \cdot \text{Var}(W) \cdot \mathbb{E}[(a_{l-1}^{(k)})^2]$ (since $z$ is zero-mean). After ReLU: $a_l = \max(0, z)$ . For $z \sim \mathcal{N}(0, \sigma^2)$ , $\mathbb{E}[\max(0, z)^2] = \sigma^2/2$ (half of the symmetric second moment; the variance is strictly smaller because the mean is no longer zero). Setting $\mathbb{E}[(a_l)^2] = \mathbb{E}[(a_{l-1})^2]$ gives $n_{\text{in}} \cdot \text{Var}(W) / 2 = 1$ , so $\text{Var}(W) = 2/n_{\text{in}}$ .

Why It Matters

He initialization made it possible to train very deep ReLU networks (100+ layers). It was a key ingredient in the ResNet paper (He et al., 2015), which demonstrated that networks with 152 layers could train successfully.

Failure Mode

He initialization assumes standard ReLU. For leaky ReLU with slope $\alpha$ on the negative side, the correction factor is $2/(1 + \alpha^2)$ instead of 2. For GELU, SiLU, or other smooth activations, the exact correction differs but He initialization is still a reasonable starting point.

report a correction →

The Symmetry Breaking Argument

Proposition

Zero Initialization Fails

Statement

If all weights in a layer are initialized to the same value (including zero), then all neurons in that layer compute the same function, receive the same gradient, and remain identical throughout training. The layer effectively has only one neuron regardless of its width.

Intuition

At initialization, every neuron in a layer computes $w^T x + b$ with identical $w$ and $b$ . The outputs are identical, so the loss gradient with respect to each neuron's weights is identical. After the gradient update, all weights remain equal. This symmetry is never broken by gradient descent.

Proof Sketch

By induction on the training step. At step 0, all neurons in layer $l$ have weights $w$ and bias $b$ , so $a_l^{(j)} = w^T a_{l-1} + b$ for all $j$ . The gradient $\partial L / \partial w_l^{(j)}$ depends on $a_l^{(j)}$ and $\partial L / \partial a_l^{(j)}$ . Since all $a_l^{(j)}$ are equal and the downstream computation treats all neurons symmetrically, all gradients are equal. The update $w \leftarrow w - \eta g$ preserves equality.

Why It Matters

This is why random initialization is necessary, not optional. The randomness serves one purpose: break the symmetry between neurons so they can specialize to different features during training. The magnitude of the random initialization then determines signal propagation (Xavier/He), but the randomness itself is the mechanism for expressivity.

Failure Mode

Biases can safely be initialized to zero. They do not cause symmetry problems because different neurons in the same layer share the same bias value but have different weight vectors (due to random weight initialization). The bias just shifts the activation; it does not contribute to the symmetry between neurons. The exception is LSTM forget gate biases, which are typically initialized to 1 to encourage gradient flow at the start of training.

report a correction →

What Happens with All-Zero Initialization

Setting all weights to zero is the most catastrophic initialization. Beyond the symmetry problem, zero weights mean zero activations at every layer (for networks without bias). The gradient of the loss with respect to the weights is also zero (since activations are zero), so gradient descent makes no progress. The network is stuck at its initial state permanently.

With biases but zero weights, the network produces constant output $b_L$ regardless of input. Gradients are nonzero but identical across neurons in each layer, so the symmetry is never broken.

Concrete Examples of Bad Initialization

Example

Variance explosion with naive initialization

Consider a 20-layer ReLU network with width 512 at every layer. Initialize weights as $W_{ij} \sim \mathcal{N}(0, 0.1^2)$ . The variance multiplier per layer is $n \sigma^2 / 2 = 512 \times 0.01 / 2 = 2.56$ (the factor of 2 accounts for ReLU zeroing negative inputs). After 20 layers: $2.56^{20} \approx 1.5 \times 10^8$ . An input with unit variance produces activations with variance around $10^8$ at the output layer, standard deviations around $10^4$ . Training is unstable; gradients saturate and the loss blows up within a few steps. A modestly worse setting (e.g., $\sigma = 0.2$ giving multiplier $10.24$ ) reaches float32 overflow in single digits of depth.

Reducing to $\sigma = 0.01$ : the multiplier becomes $512 \times 0.0001 / 2 = 0.0256$ . After 20 layers: $0.0256^{20} \approx 10^{-32}$ . Activations vanish to zero. Gradients are effectively zero. Training makes no progress.

He initialization sets $\sigma^2 = 2/512 \approx 0.0039$ . The multiplier is $512 \times 0.0039 / 2 = 1.0$ . After 20 layers: $1.0^{20} = 1$ . Variance is preserved exactly.

Example

GPT-style transformer initialization

Large transformer models use a scaled initialization for residual stream contributions. In a transformer with $L$ layers, each layer adds to the residual stream. If each addition has variance $\sigma^2$ , the residual stream variance grows as $L\sigma^2$ after $L$ layers. GPT-2 scales the output projection of each attention and MLP block by $1/\sqrt{2L}$ , so the total variance contribution from all $L$ layers is $L \times 1/(2L) = 1/2$ . This keeps the residual stream variance bounded regardless of depth. Without this scaling, a 96-layer GPT-3 would have activations growing by a factor of $\sqrt{96} \approx 10$ from residual accumulation alone.

Watch Out

Initialization interacts with residual connections

In a ResNet, the output of each block is $x + F(x)$ . If $F(x)$ has variance $v$ and $x$ has variance $V$ , the output has variance $V + v$ (assuming independence). After $L$ blocks: variance is $V_0 + Lv$ . This linear growth (instead of the exponential growth without skip connections) makes ResNets much less sensitive to initialization. But for very deep ResNets (1000+ layers), even linear growth can cause problems. Fixup initialization (Zhang, Dauphin, Ma 2019) rescales residual branches by $L^{-1/(2m-2)}$ where $m$ is the number of layers inside each block, removing the need for normalization. ReZero (Bachlechner et al. 2020) initializes a learnable scalar $\alpha = 0$ on each residual branch, so the network starts as the identity and gradients propagate unmodified. For transformers, DeepNorm (Wang et al. 2022) scales the residual branch by $\alpha = (2L)^{1/4}$ and the weights by $\beta = (8L)^{-1/4}$ , enabling stable training of 1000-layer Transformers.

Xavier Derivation: Forward and Backward Conditions

The Xavier initialization variance $2/(n_{\text{in}} + n_{\text{out}})$ is a compromise between two conflicting requirements.

Forward pass condition. For $a_l^{(j)} = \sum_{k=1}^{n_{\text{in}}} W_{jk} a_{l-1}^{(k)}$ with independent zero-mean weights and activations:

$\text{Var}(a_l^{(j)}) = n_{\text{in}} \cdot \text{Var}(W) \cdot \text{Var}(a_{l-1}^{(k)})$

Preserving variance requires $\text{Var}(W) = 1/n_{\text{in}}$ .

Backward pass condition. The gradient flows as $\delta_{l-1}^{(k)} = \sum_{j=1}^{n_{\text{out}}} W_{jk} \delta_l^{(j)}$ , where $\delta_l = \partial L / \partial a_l$ . By the same argument:

$\text{Var}(\delta_{l-1}^{(k)}) = n_{\text{out}} \cdot \text{Var}(W) \cdot \text{Var}(\delta_l^{(j)})$

Preserving gradient variance requires $\text{Var}(W) = 1/n_{\text{out}}$ .

These two conditions are incompatible unless $n_{\text{in}} = n_{\text{out}}$ . Xavier's solution is the harmonic mean: $\text{Var}(W) = 2/(n_{\text{in}} + n_{\text{out}})$ , which approximately preserves both forward and backward signal magnitudes. For layers where $n_{\text{in}} \approx n_{\text{out}}$ (common in practice), this is close to both $1/n_{\text{in}}$ and $1/n_{\text{out}}$ .

The uniform variant samples from $U(-a, a)$ where $a = \sqrt{6/(n_{\text{in}} + n_{\text{out}})}$ , since a uniform on $[-a, a]$ has variance $a^2/3$ .

Signal Propagation Theory

Signal propagation theory generalizes the Xavier/He analysis to arbitrary architectures and activation functions. The central question: for a random input $x$ , what happens to the distribution of $a_l^{(j)}$ as $l$ grows?

Define the mean field quantities:

$q_l = \mathbb{E}[(a_l^{(j)})^2], \quad c_l = \frac{\mathbb{E}[a_l^{(j)} \tilde{a}_l^{(j)}]}{\sqrt{q_l \tilde{q}_l}}$

where $a_l$ and $\tilde{a}_l$ are activations from two different inputs. The quantity $q_l$ tracks signal magnitude, and $c_l$ tracks the correlation between representations of different inputs.

For stable training, we need:

$q_l$ stays bounded and bounded away from zero (no explosion or collapse).
$c_l$ does not converge to 1 (otherwise all inputs produce the same representation, and the network cannot distinguish them).

Both conditions place constraints on the weight variance $\sigma_w^2$ and bias variance $\sigma_b^2$ . The critical line $\sigma_w^2 \mathbb{E}[\phi'(z)^2] = 1$ (where $\phi$ is the activation function and $z \sim \mathcal{N}(0, q^*)$ ) separates the ordered phase (signals collapse) from the chaotic phase (signals explode). Xavier and He initialization place the network near this critical line.

Connection to NTK Parameterization

The Neural Tangent Kernel (NTK) parameterization scales the output of each layer by $1/\sqrt{n}$ , where $n$ is the layer width:

$f(x) = \frac{1}{\sqrt{n_L}} W_L \phi\left(\frac{1}{\sqrt{n_{L-1}}} W_{L-1} \phi(\cdots)\right)$

with $W_l^{(ij)} \sim \mathcal{N}(0, 1)$ (unit variance, not scaled). The $1/\sqrt{n}$ factors are built into the architecture rather than the initialization.

Where the factor of 2 lives. The NTK parameterization with $\sigma_w^2 = 1$ per-layer preserves forward signal variance for linear activations only: the per-layer multiplier is $\sigma_w^2 / 1 = 1$ . For ReLU, the same scaling halves the activation second moment each layer (since $\mathbb{E}[\phi(z)^2] = \mathbb{E}[z^2]/2$ when $z$ is symmetric), so the network is on the ordered side of the edge of chaos. He initialization compensates by putting the factor of 2 into the weight variance ( $\sigma_w^2 = 2$ under the NTK parameterization, or equivalently $\text{Var}(W) = 2/n$ under standard parameterization). In short: the factor of 2 lives in either the weight variance (He/standard) or the per-layer NTK prefactor ( $\sqrt{2}/\sqrt{n}$ with unit-variance $W$ ), but it must live somewhere whenever ReLU is used. The two conventions describe the same function class; they differ only in how learning-rate scaling interacts with width.

The NTK parameterization is the standard in theoretical analyses of infinite-width networks, where the network's training dynamics converge to a kernel regression with a fixed kernel (the NTK). The practical consequence: understanding initialization in the NTK framework connects finite-width training dynamics to the well-understood theory of kernel methods.

Watch Out

NTK parameterization does not change the function class

The NTK parameterization $f(x) = W_L \phi(W_{L-1} \phi(\cdots)) / \sqrt{n^L}$ and the standard parameterization with He-initialized weights represent the same set of functions. The difference is in how the learning rate scales with width. In the NTK parameterization, the gradient update $\Delta f$ has magnitude $O(1)$ regardless of width, which makes the infinite-width limit well-defined. In the standard parameterization, you must scale the learning rate as $\eta \propto 1/n$ to get the same behavior.

Connection to Random Matrix Theory

At initialization, the weight matrices $W_1, \ldots, W_L$ are random matrices. The product $W_L \cdots W_1$ determines how signals propagate. For $W_l$ with i.i.d. Gaussian entries of variance $1/n$ , the expected squared singular value of each factor is 1, so the second moment of the activations is preserved on average — this is exactly the property Xavier/He initialization buys. Crucially, this is a statement about averages, not about the full singular value spectrum.

Variance preservation $\neq$ dynamical isometry. For Gaussian or He-initialized weights with ReLU activations, the full spectrum of $W_L \cdots W_1$ does not concentrate at 1 — even when the second moment is preserved, the singular values spread out and the spectrum has a long tail that grows with depth (Pennington, Schoenholz & Ganguli, 2017). Standard inits keep activations from blowing up on average, but individual input directions can still be amplified or attenuated exponentially. Dynamical isometry — where every singular value of the input-output Jacobian is close to 1 — is a strictly stronger condition that requires orthogonal initialization combined with activations whose derivative spectrum is well-behaved (e.g. orthogonal init + tanh near the linear regime). For Gaussian + ReLU, dynamical isometry is unattainable at any variance.

The connection in one line: Xavier/He pick the variance so the mean squared singular value of each factor is 1, which is enough to prevent forward/backward signal blowup on average. Achieving the stronger dynamical isometry property requires a different initialization family.

Common Confusions

Watch Out

Xavier is not wrong for ReLU, it is suboptimal

Xavier initialization with ReLU networks does not cause immediate divergence. It causes gradual signal decay because each ReLU layer halves the variance that Xavier predicts will be preserved. For shallow networks (5-10 layers), the decay is mild. For deep networks (50+ layers), the effect compounds and training fails. He initialization is the correct fix.

Watch Out

Initialization is about the first step, not the entire training

Good initialization ensures stable gradients at step 0. Once training begins, the weights move away from their initial values. Batch normalization and residual connections help maintain stability throughout training, reducing (but not eliminating) the importance of initialization.

Summary

Bad initialization causes exponential growth or decay of activations and gradients across layers
Xavier: $\text{Var}(W) = 2/(n_{\text{in}} + n_{\text{out}})$ , designed for linear/tanh activations
He: $\text{Var}(W) = 2/n_{\text{in}}$ , designed for ReLU (doubles Xavier to compensate for zeroing negative inputs)
Both derive from a single principle: preserve variance across layers
Random matrix theory explains why these scalings keep the mean squared singular value near 1, preventing average-case blowup; the stronger dynamical isometry condition requires orthogonal initialization, not Gaussian/ReLU

Exercises

ExerciseCore

Problem

A network has 10 hidden layers, each with 512 units and ReLU activations. Compute the activation variance at layer 10 relative to the input variance under (a) Xavier initialization and (b) He initialization.

ExerciseAdvanced

Problem

Derive the correct initialization variance for a layer using Leaky ReLU with negative slope $\alpha = 0.01$ . How much does it differ from standard He initialization?

ExerciseCore

Problem

A network has 3 hidden layers with widths 256, 128, 64 and uses tanh activations. Write the Xavier initialization variance for each layer's weight matrix.

ExerciseAdvanced

Problem

Explain why initializing all weights to the same nonzero constant (e.g., $W_{ij} = 0.01$ for all $i, j$ ) fails even though the weights are nonzero. How does this differ from initializing weights to a constant but biases randomly?

References

Canonical:

Glorot & Bengio, "Understanding the difficulty of training deep feedforward neural networks" (2010), AISTATS, Sections 1-4
He et al., "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification" (2015), ICCV, Section 2 (initialization derivation)
LeCun et al., "Efficient BackProp" (1998), in Neural Networks: Tricks of the Trade, Section 4.6 (weight initialization heuristics)

Current:

Jacot, Gabriel, Hongler, "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (2018), NeurIPS (NTK parameterization)
Pennington & Worah, "Nonlinear Random Matrix Theory for Deep Learning" (2017), NeurIPS (signal propagation analysis)
Schoenholz et al., "Deep Information Propagation" (2017), ICLR (mean field theory for deep networks, edge of chaos)
Huang et al., "Improving Transformer Training with Orthogonal Initialization" (2020)
Zhang, Dauphin & Ma, "Fixup Initialization: Residual Learning Without Normalization" (2019), ICLR (rescaling residual branches by depth)
Bachlechner et al., "ReZero is All You Need: Fast Convergence at Large Depth" (2020), arXiv:2003.04887 (learnable zero-init scalar on residual branches)
Wang et al., "DeepNet: Scaling Transformers to 1,000 Layers" (2022), arXiv:2203.00555 (DeepNorm residual and weight scaling)
Noci et al., "Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse" (2022), NeurIPS (rank-collapse analysis of transformer init)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Eigenvalues and Eigenvectorslayer 0A · tier 1
Activation Functionslayer 1 · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1

Derived topics

1

Batch Normalizationlayer 2 · tier 1

Graph-backed continuations

Batch Normalization