Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Principal Component Analysis

Dimensionality reduction via variance maximization: PCA as eigendecomposition of the covariance matrix, PCA as truncated SVD of the centered data matrix, reconstruction error, and when sample PCA works.

CoreTier 1Stable~55 min

Why This Matters

Data with principal componentsPC1 (78%)PC2 (15%)Variance explained78%PC115%PC24%PC32%PC41%PC5cumulative20%40%60%80%

PCA is the most widely used dimensionality reduction technique in all of data science. It appears everywhere: preprocessing for ML pipelines, visualization (project to 2D), noise reduction, feature extraction, genomics (population structure), finance (factor models), and image compression. Understanding PCA requires understanding the connection between three mathematical objects. The covariance matrix, its eigendecomposition, and the SVD of the data matrix. And knowing exactly how they relate.

Mental Model

You have nn data points in Rd\mathbb{R}^d and want to find a low-dimensional subspace that captures as much of the variation in the data as possible. PCA finds the directions (principal components) along which the data varies most, ordered by decreasing variance. Projecting onto the top kk principal components gives the best rank-kk approximation to the centered data.

Setup and Notation

Let XRn×dX \in \mathbb{R}^{n \times d} be the data matrix with each row xiTx_i^T a data point. Assume the data is centered: xˉ=1nixi=0\bar{x} = \frac{1}{n}\sum_i x_i = 0. If not, subtract the mean first.

Definition

Sample Covariance Matrix

The sample covariance matrix is:

Σ^=1nXTXRd×d\hat{\Sigma} = \frac{1}{n} X^T X \in \mathbb{R}^{d \times d}

This is symmetric positive semi-definite. Its eigenvalues λ1λ2λd0\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0 are the variances along the principal directions.

Definition

Principal Components

The principal components are the eigenvectors v1,v2,,vdv_1, v_2, \ldots, v_d of Σ^\hat{\Sigma}, ordered by decreasing eigenvalue. The kk-th principal component direction vkv_k is the direction of the kk-th largest variance in the data, subject to being orthogonal to v1,,vk1v_1, \ldots, v_{k-1}.

PCA as Variance Maximization

Theorem

PCA Maximizes Projected Variance

Statement

The first principal component v1v_1 solves:

v1=argmaxv=11ni=1n(vTxi)2=argmaxv=1vTΣ^vv_1 = \arg\max_{\|v\|=1} \frac{1}{n}\sum_{i=1}^n (v^T x_i)^2 = \arg\max_{\|v\|=1} v^T \hat{\Sigma} v

The maximum value is λ1\lambda_1, the largest eigenvalue of Σ^\hat{\Sigma}. More generally, the kk-th principal component maximizes projected variance subject to orthogonality to the first k1k-1 components.

Intuition

Among all unit directions in Rd\mathbb{R}^d, v1v_1 is the one along which the data has the most spread (variance). The second PC is the direction of most remaining spread after removing the v1v_1 component, and so on. Each eigenvalue tells you how much variance that direction captures.

Proof Sketch

We want maxv=1vTΣ^v\max_{\|v\|=1} v^T \hat{\Sigma} v. Form the Lagrangian L=vTΣ^vλ(vTv1)L = v^T \hat{\Sigma} v - \lambda(v^T v - 1). Setting vL=0\nabla_v L = 0 gives Σ^v=λv\hat{\Sigma} v = \lambda v. an eigenvalue equation. The objective at any eigenvector vkv_k equals λk\lambda_k. So the maximum is λ1\lambda_1 achieved at v1v_1. For subsequent PCs, add orthogonality constraints and use induction.

Why It Matters

This shows PCA has a clear statistical interpretation: it finds the directions that explain the most variance in the data. The eigenvalues quantify exactly how much variance each component captures, enabling the scree plot for choosing the number of components.

Failure Mode

PCA maximizes variance, not necessarily "importance." If the signal lives in a low-variance direction (e.g., a rare but meaningful feature), PCA will discard it in favor of high-variance noise. PCA also only captures linear structure; it cannot find nonlinear manifolds.

PCA via SVD

The SVD of the centered data matrix X=UΣVTX = U \Sigma V^T where URn×nU \in \mathbb{R}^{n \times n}, ΣRn×d\Sigma \in \mathbb{R}^{n \times d}, and VRd×dV \in \mathbb{R}^{d \times d}.

The connection: Σ^=1nXTX=VΣTΣnVT\hat{\Sigma} = \frac{1}{n}X^T X = V \frac{\Sigma^T \Sigma}{n} V^T.

So the right singular vectors VV are the principal component directions, and the squared singular values divided by nn are the eigenvalues of Σ^\hat{\Sigma}: λk=σk2/n\lambda_k = \sigma_k^2 / n.

In practice, always compute PCA via the SVD of XX, not by forming XTXX^T X and computing its eigendecomposition. The SVD is numerically more stable because forming XTXX^T X squares the condition number.

Truncated PCA and Reconstruction

Definition

Truncated PCA

The rank-kk PCA approximation keeps only the top kk principal components. The projection of a data point xx onto the kk-dimensional subspace is:

x~=VkVkTx\tilde{x} = V_k V_k^T x

where Vk=[v1,,vk]Rd×kV_k = [v_1, \ldots, v_k] \in \mathbb{R}^{d \times k}.

Theorem

Eckart-Young Theorem (Best Low-Rank Approximation)

Statement

The best rank-kk approximation to XX in the Frobenius norm is the truncated SVD Xk=UkΣkVkTX_k = U_k \Sigma_k V_k^T:

Xk=argminrank(M)kXMF2X_k = \arg\min_{\text{rank}(M) \leq k} \|X - M\|_F^2

The reconstruction error is:

XXkF2=j=k+1rσj2\|X - X_k\|_F^2 = \sum_{j=k+1}^{r} \sigma_j^2

where r=rank(X)r = \text{rank}(X).

Intuition

Truncated PCA gives you the closest rank-kk matrix to your data. The reconstruction error is exactly the sum of squared singular values you threw away. This is why you look at the eigenvalue spectrum to decide how many components to keep.

Proof Sketch

By the SVD, X=jσjujvjTX = \sum_j \sigma_j u_j v_j^T. Any rank-kk matrix can capture at most kk singular directions. The Frobenius norm squared of XX is jσj2\sum_j \sigma_j^2 (Parseval). Keeping the largest kk singular values minimizes j=k+1rσj2\sum_{j=k+1}^r \sigma_j^2.

Why It Matters

This theorem justifies truncated PCA as optimal dimensionality reduction in the least-squares sense. It also gives a precise formula for reconstruction error in terms of the discarded eigenvalues, which is what the scree plot visualizes.

Failure Mode

Optimality is in the Frobenius norm sense. If your downstream task cares about something other than squared reconstruction error (e.g., classification accuracy), PCA may not give the best low-dimensional representation.

Choosing the Number of Components

Scree plot: Plot eigenvalues λ1,λ2,\lambda_1, \lambda_2, \ldots versus index. Look for an "elbow" where the eigenvalues drop sharply and then level off. Keep components before the elbow.

Proportion of variance explained: Keep kk components such that:

j=1kλjj=1dλj0.95 (or some threshold)\frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^d \lambda_j} \geq 0.95 \text{ (or some threshold)}

Kaiser's rule: Keep components with λk>λˉ\lambda_k > \bar{\lambda} (the average eigenvalue). For correlation PCA, this means λk>1\lambda_k > 1.

Connection to Matrix Concentration

When does sample PCA approximate population PCA? If xiDx_i \sim \mathcal{D} with population covariance Σ\Sigma, then Σ^Σ\hat{\Sigma} \to \Sigma as nn \to \infty. But the rate matters.

Matrix concentration inequalities (see the matrix concentration topic) show that Σ^ΣopO(d/n)\|\hat{\Sigma} - \Sigma\|_{\text{op}} \leq O(\sqrt{d/n}) with high probability. The Davis-Kahan theorem then bounds the angle between sample and population eigenvectors. PCA is reliable when ndn \gg d and the eigenvalue gaps are large. In high-dimensional settings (dnd \sim n or dnd \gg n), sample PCA can be highly misleading. The top sample eigenvector may point in a completely wrong direction.

Canonical Examples

Example

PCA on 2D data

Suppose nn points in R2\mathbb{R}^2 form an elliptical cloud with major axis along (1/2,1/2)(1/\sqrt{2}, 1/\sqrt{2}) and minor axis along (1/2,1/2)(-1/\sqrt{2}, 1/\sqrt{2}). The first PC is the major axis direction (most variance), and the second PC is the minor axis. Projecting onto just the first PC collapses the data to a line along the major axis, retaining the most variance possible in one dimension.

Common Confusions

Watch Out

PCA and SVD are NOT the same thing

This is the most common confusion. SVD is a matrix factorization: X=UΣVTX = U\Sigma V^T. PCA is a statistical procedure that involves (1) centering the data and (2) finding directions of maximum variance. PCA uses SVD (of the centered data matrix) as its computational engine, but they are different concepts. SVD does not center the data. If you run SVD on uncentered data, you do not get PCA. The eigenvalues of Σ^\hat{\Sigma} are σk2/n\sigma_k^2/n, not σk\sigma_k.

Watch Out

PCA components are not features

The principal components are linear combinations of the original features. The first PC is v1Tx=jv1jxjv_1^T x = \sum_j v_{1j} x_j, which mixes all original features. Interpreting what a principal component "means" requires examining the loadings v1jv_{1j}. High-dimensional PCA components are often uninterpretable.

Summary

  • PCA finds directions of maximum variance: maxv=1vTΣ^v\max_{\|v\|=1} v^T \hat{\Sigma} v
  • Principal components = eigenvectors of Σ^=1nXTX\hat{\Sigma} = \frac{1}{n}X^T X
  • Eigenvalues = variances along principal directions
  • Compute via SVD of XX (not eigendecomposition of XTXX^T X) for stability
  • Eckart-Young: truncated SVD is the best low-rank approximation
  • Reconstruction error = sum of discarded eigenvalues
  • PCA requires centering; SVD alone does not center

Exercises

ExerciseCore

Problem

Show that the first principal component direction v1v_1 maximizes the projected variance 1ni(vTxi)2\frac{1}{n}\sum_i (v^T x_i)^2 subject to v=1\|v\| = 1. Use the method of Lagrange multipliers.

ExerciseAdvanced

Problem

Explain why computing PCA via the eigendecomposition of XTXX^T X is numerically inferior to computing PCA via the SVD of XX. What goes wrong in floating-point arithmetic?

References

Canonical:

  • Jolliffe, Principal Component Analysis (2002), Chapters 1-3
  • Strang, Linear Algebra and Its Applications, Chapter on SVD

Current:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 23

  • Vershynin, High-Dimensional Probability (2018), Chapter 4 (for matrix concentration and sample PCA)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

The natural next step from PCA:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics