ML Methods
Principal Component Analysis
Dimensionality reduction via variance maximization: PCA as eigendecomposition of the covariance matrix, PCA as truncated SVD of the centered data matrix, reconstruction error, and when sample PCA works.
Why This Matters
PCA is the most widely used dimensionality reduction technique in all of data science. It appears everywhere: preprocessing for ML pipelines, visualization (project to 2D), noise reduction, feature extraction, genomics (population structure), finance (factor models), and image compression. Understanding PCA requires understanding the connection between three mathematical objects. The covariance matrix, its eigendecomposition, and the SVD of the data matrix. And knowing exactly how they relate.
Mental Model
You have data points in and want to find a low-dimensional subspace that captures as much of the variation in the data as possible. PCA finds the directions (principal components) along which the data varies most, ordered by decreasing variance. Projecting onto the top principal components gives the best rank- approximation to the centered data.
Setup and Notation
Let be the data matrix with each row a data point. Assume the data is centered: . If not, subtract the mean first.
Sample Covariance Matrix
The sample covariance matrix is:
This is symmetric positive semi-definite. Its eigenvalues are the variances along the principal directions.
Principal Components
The principal components are the eigenvectors of , ordered by decreasing eigenvalue. The -th principal component direction is the direction of the -th largest variance in the data, subject to being orthogonal to .
PCA as Variance Maximization
PCA Maximizes Projected Variance
Statement
The first principal component solves:
The maximum value is , the largest eigenvalue of . More generally, the -th principal component maximizes projected variance subject to orthogonality to the first components.
Intuition
Among all unit directions in , is the one along which the data has the most spread (variance). The second PC is the direction of most remaining spread after removing the component, and so on. Each eigenvalue tells you how much variance that direction captures.
Proof Sketch
We want . Form the Lagrangian . Setting gives . an eigenvalue equation. The objective at any eigenvector equals . So the maximum is achieved at . For subsequent PCs, add orthogonality constraints and use induction.
Why It Matters
This shows PCA has a clear statistical interpretation: it finds the directions that explain the most variance in the data. The eigenvalues quantify exactly how much variance each component captures, enabling the scree plot for choosing the number of components.
Failure Mode
PCA maximizes variance, not necessarily "importance." If the signal lives in a low-variance direction (e.g., a rare but meaningful feature), PCA will discard it in favor of high-variance noise. PCA also only captures linear structure; it cannot find nonlinear manifolds.
PCA via SVD
The SVD of the centered data matrix where , , and .
The connection: .
So the right singular vectors are the principal component directions, and the squared singular values divided by are the eigenvalues of : .
In practice, always compute PCA via the SVD of , not by forming and computing its eigendecomposition. The SVD is numerically more stable because forming squares the condition number.
Truncated PCA and Reconstruction
Truncated PCA
The rank- PCA approximation keeps only the top principal components. The projection of a data point onto the -dimensional subspace is:
where .
Eckart-Young Theorem (Best Low-Rank Approximation)
Statement
The best rank- approximation to in the Frobenius norm is the truncated SVD :
The reconstruction error is:
where .
Intuition
Truncated PCA gives you the closest rank- matrix to your data. The reconstruction error is exactly the sum of squared singular values you threw away. This is why you look at the eigenvalue spectrum to decide how many components to keep.
Proof Sketch
By the SVD, . Any rank- matrix can capture at most singular directions. The Frobenius norm squared of is (Parseval). Keeping the largest singular values minimizes .
Why It Matters
This theorem justifies truncated PCA as optimal dimensionality reduction in the least-squares sense. It also gives a precise formula for reconstruction error in terms of the discarded eigenvalues, which is what the scree plot visualizes.
Failure Mode
Optimality is in the Frobenius norm sense. If your downstream task cares about something other than squared reconstruction error (e.g., classification accuracy), PCA may not give the best low-dimensional representation.
Choosing the Number of Components
Scree plot: Plot eigenvalues versus index. Look for an "elbow" where the eigenvalues drop sharply and then level off. Keep components before the elbow.
Proportion of variance explained: Keep components such that:
Kaiser's rule: Keep components with (the average eigenvalue). For correlation PCA, this means .
Connection to Matrix Concentration
When does sample PCA approximate population PCA? If with population covariance , then as . But the rate matters.
Matrix concentration inequalities (see the matrix concentration topic) show that with high probability. The Davis-Kahan theorem then bounds the angle between sample and population eigenvectors. PCA is reliable when and the eigenvalue gaps are large. In high-dimensional settings ( or ), sample PCA can be highly misleading. The top sample eigenvector may point in a completely wrong direction.
Canonical Examples
PCA on 2D data
Suppose points in form an elliptical cloud with major axis along and minor axis along . The first PC is the major axis direction (most variance), and the second PC is the minor axis. Projecting onto just the first PC collapses the data to a line along the major axis, retaining the most variance possible in one dimension.
Common Confusions
PCA and SVD are NOT the same thing
This is the most common confusion. SVD is a matrix factorization: . PCA is a statistical procedure that involves (1) centering the data and (2) finding directions of maximum variance. PCA uses SVD (of the centered data matrix) as its computational engine, but they are different concepts. SVD does not center the data. If you run SVD on uncentered data, you do not get PCA. The eigenvalues of are , not .
PCA components are not features
The principal components are linear combinations of the original features. The first PC is , which mixes all original features. Interpreting what a principal component "means" requires examining the loadings . High-dimensional PCA components are often uninterpretable.
Summary
- PCA finds directions of maximum variance:
- Principal components = eigenvectors of
- Eigenvalues = variances along principal directions
- Compute via SVD of (not eigendecomposition of ) for stability
- Eckart-Young: truncated SVD is the best low-rank approximation
- Reconstruction error = sum of discarded eigenvalues
- PCA requires centering; SVD alone does not center
Exercises
Problem
Show that the first principal component direction maximizes the projected variance subject to . Use the method of Lagrange multipliers.
Problem
Explain why computing PCA via the eigendecomposition of is numerically inferior to computing PCA via the SVD of . What goes wrong in floating-point arithmetic?
References
Canonical:
- Jolliffe, Principal Component Analysis (2002), Chapters 1-3
- Strang, Linear Algebra and Its Applications, Chapter on SVD
Current:
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 23
-
Vershynin, High-Dimensional Probability (2018), Chapter 4 (for matrix concentration and sample PCA)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
The natural next step from PCA:
- Random Matrix Theory Overview: what happens to PCA in high dimensions
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Eigenvalues and EigenvectorsLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Singular Value DecompositionLayer 0A
Builds on This
- Dimensionality Reduction TheoryLayer 2
- Mechanistic InterpretabilityLayer 4
- t-SNE and UMAPLayer 2
- Whitening and DecorrelationLayer 2