Whitening and Decorrelation

Sneiderman, Robby

Numerical Optimization

Whitening and Decorrelation

Transform data to have identity covariance, removing correlations and normalizing scales. ZCA and PCA whitening, why whitening helps optimization, and connections to batch normalization.

CoreTier 2StableSupporting~35 min

Prerequisites

Eigenvalues and Eigenvectors Principal Component Analysis Floating Point Arithmetic

Prereq Map

Learning position

Read this page in the graph.

numerical-optimization | layer 2 | tier 2. This page has 3 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Batch Normalization

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Correlated features with different scales make optimization hard. Gradient descent on an elongated elliptical loss surface zigzags instead of heading straight to the minimum. Whitening removes correlations and equalizes scales, turning that ellipse into a circle. The condition number drops to 1, and gradient descent converges in one step on quadratics.

Whitening also explains why batch normalization and layer normalization work: they are approximate, online versions of whitening applied inside neural networks.

Mental Model

Imagine your data lives in a tilted, stretched ellipsoid. Some directions have high variance, others have low variance, and the axes are rotated relative to the coordinate axes. Whitening transforms this ellipsoid into a unit sphere: all directions have equal variance and are uncorrelated.

There are many ways to map an ellipsoid to a sphere. PCA whitening rotates to the principal axes first, then scales. ZCA whitening scales while staying as close as possible to the original coordinate system.

Formal Setup and Notation

Let $x \in \mathbb{R}^d$ be a random vector with mean $\mu = \mathbb{E}[x]$ and covariance $\Sigma = \mathbb{E}[(x - \mu)(x - \mu)^T]$ .

Definition

Whitening Transform

A whitening transform is any linear transformation $W$ such that the transformed variable $z = W(x - \mu)$ has identity covariance:

$\mathbb{E}[zz^T] = I$

This requires $W \Sigma W^T = I$ , so $W = U \Sigma^{-1/2}$ for any orthogonal matrix $U$ . Different choices of $U$ give different whitening transforms, all equally valid statistically but with different geometric properties.

Definition

PCA Whitening

Let $\Sigma = V \Lambda V^T$ be the eigendecomposition of the covariance matrix, where $V$ is orthogonal and $\Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_d)$ . PCA whitening is:

$z = \Lambda^{-1/2} V^T (x - \mu)$

This first rotates to the principal component axes ( $V^T(x-\mu)$ ), then scales each axis by $1/\sqrt{\lambda_j}$ . The result lives in PCA coordinates: it is decorrelated and normalized, but rotated away from the original coordinate system.

Definition

ZCA Whitening

ZCA (Zero-phase Component Analysis) whitening is:

$z = \Sigma^{-1/2}(x - \mu) = V \Lambda^{-1/2} V^T (x - \mu)$

where $\Sigma^{-1/2} = V \Lambda^{-1/2} V^T$ is the symmetric matrix square root of $\Sigma^{-1}$ . ZCA whitening produces the whitened data that is closest to the original data in the $L_2$ sense. It preserves the orientation of the original coordinate system as much as possible.

Core Definitions

The relationship between PCA and ZCA whitening is simple. PCA whitening applies $W_{\text{PCA}} = \Lambda^{-1/2}V^T$ . ZCA whitening applies $W_{\text{ZCA}} = V\Lambda^{-1/2}V^T$ . They differ by a rotation $V$ :

$W_{\text{ZCA}} = V \cdot W_{\text{PCA}}$

ZCA whitening rotates back to the original coordinate system after PCA whitening. This means ZCA-whitened images still look like images (just with enhanced edges), while PCA-whitened images are in abstract principal component space.

Why whitening helps optimization: for the quadratic objective $f(x) = \frac{1}{2}x^T\Sigma x$ , gradient descent converges at a rate that depends on the condition number $\kappa = \lambda_{\max}/\lambda_{\min}$ of $\Sigma$ . After whitening, $\kappa = 1$ , so gradient descent converges in one step. For general objectives, whitening the data (or the gradient) reduces the effective condition number, speeding up convergence.

Main Theorems

Proposition

Whitening Produces Identity Covariance

Statement

Let $\Sigma$ be the covariance of $x$ and $W$ be any matrix satisfying $W\Sigma W^T = I$ . Then $z = W(x - \mu)$ has:

$\mathbb{E}[z] = 0, \quad \mathrm{Cov}(z) = I$

Among all such $W$ , the ZCA whitening matrix $W = \Sigma^{-1/2}$ minimizes $\mathbb{E}[\|z - (x - \mu)\|^2]$ , i.e., it produces whitened data closest to the centered original data.

Intuition

Any $W$ satisfying $W\Sigma W^T = I$ maps the covariance ellipsoid to the unit sphere. There are infinitely many such $W$ (differing by orthogonal rotations). ZCA picks the one that rotates the least: it finds the "shortest path" from the original coordinate system to a whitened one.

Proof Sketch

$\mathrm{Cov}(z) = W \mathrm{Cov}(x) W^T = W\Sigma W^T = I$ by construction. For the optimality of ZCA: any $W$ satisfying $W\Sigma W^T = I$ can be written as $W = U\Sigma^{-1/2}$ for orthogonal $U$ . Then $\mathbb{E}[\|Wx' - x'\|^2] = \mathrm{tr}((W-I)\Sigma(W-I)^T)$ where $x' = x - \mu$ . Expanding and using $W\Sigma W^T = I$ , this is minimized when $U = I$ , giving $W = \Sigma^{-1/2}$ .

Why It Matters

This formalizes why whitening is a natural preprocessing step. It removes all second-order structure from the data, leaving only higher-order dependencies for the model to learn. For linear models, whitening makes the optimization landscape perfectly conditioned.

Failure Mode

Whitening requires estimating $\Sigma$ and computing its inverse square root, which is $O(d^3)$ . For high-dimensional data ( $d > 10{,}000$ ), this is prohibitively expensive. Also, if some eigenvalues are near zero, $\Sigma^{-1/2}$ amplifies noise in those directions. Regularization (adding $\epsilon I$ to $\Sigma$ ) is necessary in practice.

report a correction →

Canonical Examples

Example

Whitening 2D correlated data

Let $x \sim N(0, \Sigma)$ with $\Sigma = \begin{pmatrix} 4 & 3 \\ 3 & 9 \end{pmatrix}$ . The eigendecomposition gives $\lambda_1 \approx 10.37$ , $\lambda_2 \approx 2.63$ , and the condition number is $\kappa \approx 3.94$ . After ZCA whitening, the covariance is $I$ and the condition number is 1. Gradient descent on any quadratic in the whitened coordinates converges about $4\times$ faster.

Example

Connection to batch normalization

Batch normalization in neural networks normalizes each feature independently (subtracting mean, dividing by standard deviation) within a mini-batch. This is like diagonal whitening: it equalizes scales but does not remove correlations. Full whitening would decorrelate features too, but it is too expensive inside a network. Batch normalization is a practical approximation that captures most of the benefit.

Common Confusions

Watch Out

Whitening is not standardization

Standardization (z-scoring) centers each feature and divides by its standard deviation. This removes marginal variance differences but does not remove correlations. Whitening goes further: it also removes all pairwise correlations, making the covariance exactly $I$ . Standardization is diagonal whitening; full whitening includes the off-diagonal structure.

Watch Out

Whitening does not make data Gaussian

Whitening makes the covariance $I$ , but it does not change the distribution family. If $x$ is non-Gaussian, $z = W(x-\mu)$ is also non-Gaussian (just with identity covariance). Whitening removes second-order structure only. Higher-order structure (skewness, kurtosis, nonlinear dependencies) survives whitening.

Summary

Whitening: transform data to have $\mathrm{Cov}(z) = I$
PCA whitening: rotate to principal axes, then scale; $z = \Lambda^{-1/2}V^T(x-\mu)$
ZCA whitening: closest to original data; $z = \Sigma^{-1/2}(x-\mu)$
Whitening improves optimization by reducing the condition number to 1
Batch normalization is an approximate, online, diagonal form of whitening
Regularize: use $(\Sigma + \epsilon I)^{-1/2}$ to avoid amplifying noise

Exercises

ExerciseCore

Problem

Given $\Sigma = \begin{pmatrix} 4 & 0 \\ 0 & 1 \end{pmatrix}$ (diagonal), what is the ZCA whitening matrix? What does it do geometrically?

ExerciseAdvanced

Problem

Show that for any whitening matrix $W$ with $W\Sigma W^T = I$ , there exists an orthogonal matrix $U$ such that $W = U\Sigma^{-1/2}$ .

References

Canonical:

Kessy, Lewin, Strimmer, "Optimal Whitening and Decorrelation," The American Statistician 72(4):309-314 (2018)
Bell & Sejnowski, "The 'Independent Components' of Natural Scenes are Edge Filters," Vision Research 37(23):3327-3338 (1997)
Hyvärinen, Karhunen, Oja, Independent Component Analysis, Wiley (2001), Chapter 6
LeCun, Bottou, Orr, Müller, "Efficient BackProp," in Neural Networks: Tricks of the Trade, Springer (1998)

Current:

Ioffe & Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," ICML (2015)
Krizhevsky, "Learning Multiple Layers of Features from Tiny Images," Technical Report, University of Toronto (2009), Section 4
Salimans & Kingma, "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks," NeurIPS (2016)

Next Topics

The natural next steps from whitening and decorrelation:

Batch normalization: applying whitening-like operations inside networks
Natural gradient: using the Fisher information to whiten the gradient

Last reviewed: April 27, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Eigenvalues and Eigenvectorslayer 0A · tier 1
Floating-Point Arithmeticlayer 0A · tier 1
Principal Component Analysislayer 1 · tier 1

Derived topics

2

Batch Normalizationlayer 2 · tier 1
Information Geometrylayer 3 · tier 3

Graph-backed continuations

Batch Normalization Information Geometry