Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Stability

Whitening and Decorrelation

Transform data to have identity covariance, removing correlations and normalizing scales. ZCA and PCA whitening, why whitening helps optimization, and connections to batch normalization.

CoreTier 2Stable~35 min
0

Why This Matters

Correlated features with different scales make optimization hard. Gradient descent on an elongated elliptical loss surface zigzags instead of heading straight to the minimum. Whitening removes correlations and equalizes scales, turning that ellipse into a circle. The condition number drops to 1, and gradient descent converges in one step on quadratics.

Whitening also explains why batch normalization and layer normalization work: they are approximate, online versions of whitening applied inside neural networks.

Mental Model

Imagine your data lives in a tilted, stretched ellipsoid. Some directions have high variance, others have low variance, and the axes are rotated relative to the coordinate axes. Whitening transforms this ellipsoid into a unit sphere: all directions have equal variance and are uncorrelated.

There are many ways to map an ellipsoid to a sphere. PCA whitening rotates to the principal axes first, then scales. ZCA whitening scales while staying as close as possible to the original coordinate system.

Formal Setup and Notation

Let xRdx \in \mathbb{R}^d be a random vector with mean μ=E[x]\mu = \mathbb{E}[x] and covariance Σ=E[(xμ)(xμ)T]\Sigma = \mathbb{E}[(x - \mu)(x - \mu)^T].

Definition

Whitening Transform

A whitening transform is any linear transformation WW such that the transformed variable z=W(xμ)z = W(x - \mu) has identity covariance:

E[zzT]=I\mathbb{E}[zz^T] = I

This requires WΣWT=IW \Sigma W^T = I, so W=Σ1/2UW = \Sigma^{-1/2} U for any orthogonal matrix UU. Different choices of UU give different whitening transforms, all equally valid statistically but with different geometric properties.

Definition

PCA Whitening

Let Σ=VΛVT\Sigma = V \Lambda V^T be the eigendecomposition of the covariance matrix, where VV is orthogonal and Λ=diag(λ1,,λd)\Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_d). PCA whitening is:

z=Λ1/2VT(xμ)z = \Lambda^{-1/2} V^T (x - \mu)

This first rotates to the principal component axes (VT(xμ)V^T(x-\mu)), then scales each axis by 1/λj1/\sqrt{\lambda_j}. The result lives in PCA coordinates: it is decorrelated and normalized, but rotated away from the original coordinate system.

Definition

ZCA Whitening

ZCA (Zero-phase Component Analysis) whitening is:

z=Σ1/2(xμ)=VΛ1/2VT(xμ)z = \Sigma^{-1/2}(x - \mu) = V \Lambda^{-1/2} V^T (x - \mu)

where Σ1/2=VΛ1/2VT\Sigma^{-1/2} = V \Lambda^{-1/2} V^T is the symmetric matrix square root of Σ1\Sigma^{-1}. ZCA whitening produces the whitened data that is closest to the original data in the L2L_2 sense. It preserves the orientation of the original coordinate system as much as possible.

Core Definitions

The relationship between PCA and ZCA whitening is simple. PCA whitening applies WPCA=Λ1/2VTW_{\text{PCA}} = \Lambda^{-1/2}V^T. ZCA whitening applies WZCA=VΛ1/2VTW_{\text{ZCA}} = V\Lambda^{-1/2}V^T. They differ by a rotation VV:

WZCA=VWPCAW_{\text{ZCA}} = V \cdot W_{\text{PCA}}

ZCA whitening rotates back to the original coordinate system after PCA whitening. This means ZCA-whitened images still look like images (just with enhanced edges), while PCA-whitened images are in abstract principal component space.

Why whitening helps optimization: for the quadratic objective f(x)=12xTΣxf(x) = \frac{1}{2}x^T\Sigma x, gradient descent converges at a rate that depends on the condition number κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min} of Σ\Sigma. After whitening, κ=1\kappa = 1, so gradient descent converges in one step. For general objectives, whitening the data (or the gradient) reduces the effective condition number, speeding up convergence.

Main Theorems

Proposition

Whitening Produces Identity Covariance

Statement

Let Σ\Sigma be the covariance of xx and WW be any matrix satisfying WΣWT=IW\Sigma W^T = I. Then z=W(xμ)z = W(x - \mu) has:

E[z]=0,Cov(z)=I\mathbb{E}[z] = 0, \quad \mathrm{Cov}(z) = I

Among all such WW, the ZCA whitening matrix W=Σ1/2W = \Sigma^{-1/2} minimizes E[z(xμ)2]\mathbb{E}[\|z - (x - \mu)\|^2], i.e., it produces whitened data closest to the centered original data.

Intuition

Any WW satisfying WΣWT=IW\Sigma W^T = I maps the covariance ellipsoid to the unit sphere. There are infinitely many such WW (differing by orthogonal rotations). ZCA picks the one that rotates the least: it finds the "shortest path" from the original coordinate system to a whitened one.

Proof Sketch

Cov(z)=WCov(x)WT=WΣWT=I\mathrm{Cov}(z) = W \mathrm{Cov}(x) W^T = W\Sigma W^T = I by construction. For the optimality of ZCA: any WW satisfying WΣWT=IW\Sigma W^T = I can be written as W=UΣ1/2W = U\Sigma^{-1/2} for orthogonal UU. Then E[Wxx2]=tr((WI)Σ(WI)T)\mathbb{E}[\|Wx' - x'\|^2] = \mathrm{tr}((W-I)\Sigma(W-I)^T) where x=xμx' = x - \mu. Expanding and using WΣWT=IW\Sigma W^T = I, this is minimized when U=IU = I, giving W=Σ1/2W = \Sigma^{-1/2}.

Why It Matters

This formalizes why whitening is a natural preprocessing step. It removes all second-order structure from the data, leaving only higher-order dependencies for the model to learn. For linear models, whitening makes the optimization landscape perfectly conditioned.

Failure Mode

Whitening requires estimating Σ\Sigma and computing its inverse square root, which is O(d3)O(d^3). For high-dimensional data (d>10,000d > 10{,}000), this is prohibitively expensive. Also, if some eigenvalues are near zero, Σ1/2\Sigma^{-1/2} amplifies noise in those directions. Regularization (adding ϵI\epsilon I to Σ\Sigma) is necessary in practice.

Canonical Examples

Example

Whitening 2D correlated data

Let xN(0,Σ)x \sim N(0, \Sigma) with Σ=(4339)\Sigma = \begin{pmatrix} 4 & 3 \\ 3 & 9 \end{pmatrix}. The eigendecomposition gives λ110.37\lambda_1 \approx 10.37, λ22.63\lambda_2 \approx 2.63, and the condition number is κ3.94\kappa \approx 3.94. After ZCA whitening, the covariance is II and the condition number is 1. Gradient descent on any quadratic in the whitened coordinates converges about 4×4\times faster.

Example

Connection to batch normalization

Batch normalization in neural networks normalizes each feature independently (subtracting mean, dividing by standard deviation) within a mini-batch. This is like diagonal whitening: it equalizes scales but does not remove correlations. Full whitening would decorrelate features too, but it is too expensive inside a network. Batch normalization is a practical approximation that captures most of the benefit.

Common Confusions

Watch Out

Whitening is not standardization

Standardization (z-scoring) centers each feature and divides by its standard deviation. This removes marginal variance differences but does not remove correlations. Whitening goes further: it also removes all pairwise correlations, making the covariance exactly II. Standardization is diagonal whitening; full whitening includes the off-diagonal structure.

Watch Out

Whitening does not make data Gaussian

Whitening makes the covariance II, but it does not change the distribution family. If xx is non-Gaussian, z=W(xμ)z = W(x-\mu) is also non-Gaussian (just with identity covariance). Whitening removes second-order structure only. Higher-order structure (skewness, kurtosis, nonlinear dependencies) survives whitening.

Summary

  • Whitening: transform data to have Cov(z)=I\mathrm{Cov}(z) = I
  • PCA whitening: rotate to principal axes, then scale; z=Λ1/2VT(xμ)z = \Lambda^{-1/2}V^T(x-\mu)
  • ZCA whitening: closest to original data; z=Σ1/2(xμ)z = \Sigma^{-1/2}(x-\mu)
  • Whitening improves optimization by reducing the condition number to 1
  • Batch normalization is an approximate, online, diagonal form of whitening
  • Regularize: use (Σ+ϵI)1/2(\Sigma + \epsilon I)^{-1/2} to avoid amplifying noise

Exercises

ExerciseCore

Problem

Given Σ=(4001)\Sigma = \begin{pmatrix} 4 & 0 \\ 0 & 1 \end{pmatrix} (diagonal), what is the ZCA whitening matrix? What does it do geometrically?

ExerciseAdvanced

Problem

Show that for any whitening matrix WW with WΣWT=IW\Sigma W^T = I, there exists an orthogonal matrix UU such that W=UΣ1/2W = U\Sigma^{-1/2}.

References

Canonical:

  • Kessy, Lewin, Strimmer, "Optimal Whitening and Decorrelation," The American Statistician (2018)
  • Bell & Sejnowski, "The 'Independent Components' of Natural Scenes are Edge Filters" (1997)

Current:

  • Ioffe & Szegedy, "Batch Normalization: Accelerating Deep Network Training" (2015)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009)

Next Topics

The natural next steps from whitening and decorrelation:

  • Batch normalization: applying whitening-like operations inside networks
  • Natural gradient: using the Fisher information to whiten the gradient

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics