Training Techniques
Weight Initialization
Why initialization determines whether gradients vanish or explode, and how Xavier and He initialization preserve variance across layers.
Why This Matters
A neural network with bad initialization cannot train. If weights are too large, activations and gradients explode exponentially with depth. If weights are too small, they vanish exponentially. Proper initialization preserves gradient flow across layers. The entire field of deep learning was stuck on shallow networks for decades partly because of this problem. Xavier and He initialization solved it for standard architectures, and techniques like batch normalization further reduce sensitivity to the initial scale.
Mental Model
Consider a deep network as a chain of matrix multiplications. If each matrix has eigenvalues with magnitude greater than 1, the product grows exponentially. If less than 1, it shrinks exponentially. Good initialization sets the weight matrices so their effect on signal magnitude is approximately 1 per layer. The signal (activations in the forward pass, gradients in the backward pass) neither grows nor shrinks.
Formal Setup
Consider a feedforward network with layers and no activation function (for now):
where and is the input. The output is .
Variance Preservation Property
An initialization scheme satisfies variance preservation if, for each layer :
where denotes the -th component of the activation vector at layer . Equivalently, the signal magnitude stays constant across layers at initialization.
The Problem with Naive Initialization
If has entries drawn i.i.d. from and has zero-mean entries with variance , then:
Each layer multiplies the variance by . After layers:
If for all layers, activations explode. If , they vanish. For a network with 50 layers and , the activation variance grows by . At , it shrinks by .
Main Theorems
Xavier/Glorot Initialization
Statement
For a layer with input units and output units, initializing weights as:
preserves variance in both the forward pass (activations) and the backward pass (gradients) simultaneously, under the compromise that the forward pass condition requires and the backward pass requires .
Intuition
Forward pass: each activation is a sum of terms. To keep variance at 1, each term should have variance , so . Backward pass: each gradient component is a sum of terms, requiring . Xavier takes the harmonic mean of these two requirements: .
Proof Sketch
For the forward pass, . Since and are independent and zero-mean, . Setting this equal to gives . The backward pass derivation is symmetric with .
Why It Matters
Xavier initialization enabled training of deep networks with sigmoid and tanh activations. Before Xavier, networks deeper than ~5 layers were considered impractical to train. The paper (Glorot and Bengio, 2010) was a turning point for deep learning.
Failure Mode
Xavier assumes linear or tanh activations near zero. ReLU sets half of its inputs to zero, which halves the effective fan-in. Xavier underestimates the required variance for ReLU networks, leading to signal decay. He initialization fixes this.
He/Kaiming Initialization
Statement
For a layer with input units followed by a ReLU activation, initializing weights as:
preserves the variance of activations across layers.
Intuition
ReLU zeros out negative inputs, keeping only the positive half. For a symmetric distribution, this halves the variance: . To compensate, we double the weight variance compared to Xavier (using instead of ).
Proof Sketch
Let , so . After ReLU: . For , . Setting gives , so .
Why It Matters
He initialization made it possible to train very deep ReLU networks (100+ layers). It was a key ingredient in the ResNet paper (He et al., 2015), which demonstrated that networks with 152 layers could train successfully.
Failure Mode
He initialization assumes standard ReLU. For leaky ReLU with slope on the negative side, the correction factor is instead of 2. For GELU, SiLU, or other smooth activations, the exact correction differs but He initialization is still a reasonable starting point.
The Symmetry Breaking Argument
Zero Initialization Fails
Statement
If all weights in a layer are initialized to the same value (including zero), then all neurons in that layer compute the same function, receive the same gradient, and remain identical throughout training. The layer effectively has only one neuron regardless of its width.
Intuition
At initialization, every neuron in a layer computes with identical and . The outputs are identical, so the loss gradient with respect to each neuron's weights is identical. After the gradient update, all weights remain equal. This symmetry is never broken by gradient descent.
Proof Sketch
By induction on the training step. At step 0, all neurons in layer have weights and bias , so for all . The gradient depends on and . Since all are equal and the downstream computation treats all neurons symmetrically, all gradients are equal. The update preserves equality.
Why It Matters
This is why random initialization is necessary, not optional. The randomness serves one purpose: break the symmetry between neurons so they can specialize to different features during training. The magnitude of the random initialization then determines signal propagation (Xavier/He), but the randomness itself is the mechanism for expressivity.
Failure Mode
Biases can safely be initialized to zero. They do not cause symmetry problems because different neurons in the same layer share the same bias value but have different weight vectors (due to random weight initialization). The bias just shifts the activation; it does not contribute to the symmetry between neurons. The exception is LSTM forget gate biases, which are typically initialized to 1 to encourage gradient flow at the start of training.
What Happens with All-Zero Initialization
Setting all weights to zero is the most catastrophic initialization. Beyond the symmetry problem, zero weights mean zero activations at every layer (for networks without bias). The gradient of the loss with respect to the weights is also zero (since activations are zero), so gradient descent makes no progress. The network is stuck at its initial state permanently.
With biases but zero weights, the network produces constant output regardless of input. Gradients are nonzero but identical across neurons in each layer, so the symmetry is never broken.
Concrete Examples of Bad Initialization
Variance explosion with naive initialization
Consider a 20-layer ReLU network with width 512 at every layer. Initialize weights as . The variance multiplier per layer is (the factor of 2 accounts for ReLU zeroing negative inputs). After 20 layers: . An input with unit variance produces activations with variance around at the output layer. In float32, the activations overflow to infinity after about 14 layers. The loss is NaN on the first forward pass.
Reducing to : the multiplier becomes . After 20 layers: . Activations vanish to zero. Gradients are effectively zero. Training makes no progress.
He initialization sets . The multiplier is . After 20 layers: . Variance is preserved exactly.
GPT-style transformer initialization
Large transformer models use a scaled initialization for residual stream contributions. In a transformer with layers, each layer adds to the residual stream. If each addition has variance , the residual stream variance grows as after layers. GPT-2 scales the output projection of each attention and MLP block by , so the total variance contribution from all layers is . This keeps the residual stream variance bounded regardless of depth. Without this scaling, a 96-layer GPT-3 would have activations growing by a factor of from residual accumulation alone.
Initialization interacts with residual connections
In a ResNet, the output of each block is . If has variance and has variance , the output has variance (assuming independence). After blocks: variance is . This linear growth (instead of the exponential growth without skip connections) makes ResNets much less sensitive to initialization. But for very deep ResNets (1000+ layers), even linear growth can cause problems. Fixup initialization and ReZero (initializing residual branches to zero) address this.
Xavier Derivation: Forward and Backward Conditions
The Xavier initialization variance is a compromise between two conflicting requirements.
Forward pass condition. For with independent zero-mean weights and activations:
Preserving variance requires .
Backward pass condition. The gradient flows as , where . By the same argument:
Preserving gradient variance requires .
These two conditions are incompatible unless . Xavier's solution is the harmonic mean: , which approximately preserves both forward and backward signal magnitudes. For layers where (common in practice), this is close to both and .
The uniform variant samples from where , since a uniform on has variance .
Signal Propagation Theory
Signal propagation theory generalizes the Xavier/He analysis to arbitrary architectures and activation functions. The central question: for a random input , what happens to the distribution of as grows?
Define the mean field quantities:
where and are activations from two different inputs. The quantity tracks signal magnitude, and tracks the correlation between representations of different inputs.
For stable training, we need:
- stays bounded and bounded away from zero (no explosion or collapse).
- does not converge to 1 (otherwise all inputs produce the same representation, and the network cannot distinguish them).
Both conditions place constraints on the weight variance and bias variance . The critical line (where is the activation function and ) separates the ordered phase (signals collapse) from the chaotic phase (signals explode). Xavier and He initialization place the network near this critical line.
Connection to NTK Parameterization
The Neural Tangent Kernel (NTK) parameterization scales the output of each layer by , where is the layer width:
with (unit variance, not scaled). The factors are built into the architecture rather than the initialization. This is mathematically equivalent to He initialization (weights of variance with the factor of 2 absorbed by the ReLU correction) but separates the scaling concern from the randomness.
The NTK parameterization is the standard in theoretical analyses of infinite-width networks, where the network's training dynamics converge to a kernel regression with a fixed kernel (the NTK). The practical consequence: understanding initialization in the NTK framework connects finite-width training dynamics to the well-understood theory of kernel methods.
NTK parameterization does not change the function class
The NTK parameterization and the standard parameterization with He-initialized weights represent the same set of functions. The difference is in how the learning rate scales with width. In the NTK parameterization, the gradient update has magnitude regardless of width, which makes the infinite-width limit well-defined. In the standard parameterization, you must scale the learning rate as to get the same behavior.
Connection to Random Matrix Theory
At initialization, the weight matrices are random matrices. The product determines how signals propagate. Random matrix theory tells us that the singular values of a product of random matrices with appropriate scaling concentrate around 1. Specifically, for with i.i.d. entries of variance , the singular values of the product converge to a deterministic distribution as .
The connection: Xavier and He initialization are choosing the variance so that the expected squared singular value of each factor is 1. This ensures the product neither grows nor shrinks, which is exactly the condition for stable forward and backward signal propagation.
Common Confusions
Xavier is not wrong for ReLU, it is suboptimal
Xavier initialization with ReLU networks does not cause immediate divergence. It causes gradual signal decay because each ReLU layer halves the variance that Xavier predicts will be preserved. For shallow networks (5-10 layers), the decay is mild. For deep networks (50+ layers), the effect compounds and training fails. He initialization is the correct fix.
Initialization is about the first step, not the entire training
Good initialization ensures stable gradients at step 0. Once training begins, the weights move away from their initial values. Batch normalization and residual connections help maintain stability throughout training, reducing (but not eliminating) the importance of initialization.
Key Takeaways
- Bad initialization causes exponential growth or decay of activations and gradients across layers
- Xavier: , designed for linear/tanh activations
- He: , designed for ReLU (doubles Xavier to compensate for zeroing negative inputs)
- Both derive from a single principle: preserve variance across layers
- Random matrix theory explains why these scalings keep the singular values of weight products near 1
Exercises
Problem
A network has 10 hidden layers, each with 512 units and ReLU activations. Compute the activation variance at layer 10 relative to the input variance under (a) Xavier initialization and (b) He initialization.
Problem
Derive the correct initialization variance for a layer using Leaky ReLU with negative slope . How much does it differ from standard He initialization?
Problem
A network has 3 hidden layers with widths 256, 128, 64 and uses tanh activations. Write the Xavier initialization variance for each layer's weight matrix.
Problem
Explain why initializing all weights to the same nonzero constant (e.g., for all ) fails even though the weights are nonzero. How does this differ from initializing weights to a constant but biases randomly?
References
Canonical:
- Glorot & Bengio, "Understanding the difficulty of training deep feedforward neural networks" (2010), AISTATS, Sections 1-4
- He et al., "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification" (2015), ICCV, Section 2 (initialization derivation)
- LeCun et al., "Efficient BackProp" (1998), in Neural Networks: Tricks of the Trade, Section 4.6 (weight initialization heuristics)
Current:
- Jacot, Gabriel, Hongler, "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (2018), NeurIPS (NTK parameterization)
- Pennington & Worah, "Nonlinear Random Matrix Theory for Deep Learning" (2017), NeurIPS (signal propagation analysis)
- Schoenholz et al., "Deep Information Propagation" (2017), ICLR (mean field theory for deep networks, edge of chaos)
- Huang et al., "Improving Transformer Training with Orthogonal Initialization" (2020)
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Eigenvalues and EigenvectorsLayer 0A