Numerical Stability
Whitening and Decorrelation
Transform data to have identity covariance, removing correlations and normalizing scales. ZCA and PCA whitening, why whitening helps optimization, and connections to batch normalization.
Why This Matters
Correlated features with different scales make optimization hard. Gradient descent on an elongated elliptical loss surface zigzags instead of heading straight to the minimum. Whitening removes correlations and equalizes scales, turning that ellipse into a circle. The condition number drops to 1, and gradient descent converges in one step on quadratics.
Whitening also explains why batch normalization and layer normalization work: they are approximate, online versions of whitening applied inside neural networks.
Mental Model
Imagine your data lives in a tilted, stretched ellipsoid. Some directions have high variance, others have low variance, and the axes are rotated relative to the coordinate axes. Whitening transforms this ellipsoid into a unit sphere: all directions have equal variance and are uncorrelated.
There are many ways to map an ellipsoid to a sphere. PCA whitening rotates to the principal axes first, then scales. ZCA whitening scales while staying as close as possible to the original coordinate system.
Formal Setup and Notation
Let be a random vector with mean and covariance .
Whitening Transform
A whitening transform is any linear transformation such that the transformed variable has identity covariance:
This requires , so for any orthogonal matrix . Different choices of give different whitening transforms, all equally valid statistically but with different geometric properties.
PCA Whitening
Let be the eigendecomposition of the covariance matrix, where is orthogonal and . PCA whitening is:
This first rotates to the principal component axes (), then scales each axis by . The result lives in PCA coordinates: it is decorrelated and normalized, but rotated away from the original coordinate system.
ZCA Whitening
ZCA (Zero-phase Component Analysis) whitening is:
where is the symmetric matrix square root of . ZCA whitening produces the whitened data that is closest to the original data in the sense. It preserves the orientation of the original coordinate system as much as possible.
Core Definitions
The relationship between PCA and ZCA whitening is simple. PCA whitening applies . ZCA whitening applies . They differ by a rotation :
ZCA whitening rotates back to the original coordinate system after PCA whitening. This means ZCA-whitened images still look like images (just with enhanced edges), while PCA-whitened images are in abstract principal component space.
Why whitening helps optimization: for the quadratic objective , gradient descent converges at a rate that depends on the condition number of . After whitening, , so gradient descent converges in one step. For general objectives, whitening the data (or the gradient) reduces the effective condition number, speeding up convergence.
Main Theorems
Whitening Produces Identity Covariance
Statement
Let be the covariance of and be any matrix satisfying . Then has:
Among all such , the ZCA whitening matrix minimizes , i.e., it produces whitened data closest to the centered original data.
Intuition
Any satisfying maps the covariance ellipsoid to the unit sphere. There are infinitely many such (differing by orthogonal rotations). ZCA picks the one that rotates the least: it finds the "shortest path" from the original coordinate system to a whitened one.
Proof Sketch
by construction. For the optimality of ZCA: any satisfying can be written as for orthogonal . Then where . Expanding and using , this is minimized when , giving .
Why It Matters
This formalizes why whitening is a natural preprocessing step. It removes all second-order structure from the data, leaving only higher-order dependencies for the model to learn. For linear models, whitening makes the optimization landscape perfectly conditioned.
Failure Mode
Whitening requires estimating and computing its inverse square root, which is . For high-dimensional data (), this is prohibitively expensive. Also, if some eigenvalues are near zero, amplifies noise in those directions. Regularization (adding to ) is necessary in practice.
Canonical Examples
Whitening 2D correlated data
Let with . The eigendecomposition gives , , and the condition number is . After ZCA whitening, the covariance is and the condition number is 1. Gradient descent on any quadratic in the whitened coordinates converges about faster.
Connection to batch normalization
Batch normalization in neural networks normalizes each feature independently (subtracting mean, dividing by standard deviation) within a mini-batch. This is like diagonal whitening: it equalizes scales but does not remove correlations. Full whitening would decorrelate features too, but it is too expensive inside a network. Batch normalization is a practical approximation that captures most of the benefit.
Common Confusions
Whitening is not standardization
Standardization (z-scoring) centers each feature and divides by its standard deviation. This removes marginal variance differences but does not remove correlations. Whitening goes further: it also removes all pairwise correlations, making the covariance exactly . Standardization is diagonal whitening; full whitening includes the off-diagonal structure.
Whitening does not make data Gaussian
Whitening makes the covariance , but it does not change the distribution family. If is non-Gaussian, is also non-Gaussian (just with identity covariance). Whitening removes second-order structure only. Higher-order structure (skewness, kurtosis, nonlinear dependencies) survives whitening.
Summary
- Whitening: transform data to have
- PCA whitening: rotate to principal axes, then scale;
- ZCA whitening: closest to original data;
- Whitening improves optimization by reducing the condition number to 1
- Batch normalization is an approximate, online, diagonal form of whitening
- Regularize: use to avoid amplifying noise
Exercises
Problem
Given (diagonal), what is the ZCA whitening matrix? What does it do geometrically?
Problem
Show that for any whitening matrix with , there exists an orthogonal matrix such that .
References
Canonical:
- Kessy, Lewin, Strimmer, "Optimal Whitening and Decorrelation," The American Statistician (2018)
- Bell & Sejnowski, "The 'Independent Components' of Natural Scenes are Edge Filters" (1997)
Current:
-
Ioffe & Szegedy, "Batch Normalization: Accelerating Deep Network Training" (2015)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009)
Next Topics
The natural next steps from whitening and decorrelation:
- Batch normalization: applying whitening-like operations inside networks
- Natural gradient: using the Fisher information to whiten the gradient
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Eigenvalues and EigenvectorsLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Principal Component AnalysisLayer 1
- Singular Value DecompositionLayer 0A