Principal Component Analysis

Sneiderman, Robby

ML Methods

Principal Component Analysis

Dimensionality reduction via variance maximization: PCA as eigendecomposition of the covariance matrix, PCA as truncated SVD of the centered data matrix, reconstruction error, and when sample PCA works.

CoreTier 1StableCore spine~55 min

Prerequisites

Eigenvalues and Eigenvectors Singular Value Decomposition Gram Matrices and Kernel Matrices High Dimensional Covariance Estimation

Start 8-question practice · 16 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 1 | tier 1. This page has 7 direct prerequisites and 6 published dependents.

Open Atlas Prerequisites Leads to

What next

Random Matrix Theory Overview

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

PCA is the most widely used dimensionality reduction technique in all of data science. It appears everywhere: preprocessing for ML pipelines, visualization (project to 2D), noise reduction, feature extraction, genomics (population structure), finance (factor models), and image compression. Understanding PCA requires understanding the connection between three mathematical objects: the covariance matrix, its eigendecomposition, and the SVD of the data matrix — and knowing exactly how they relate.

Mental Model

You have $n$ data points in $\mathbb{R}^d$ and want to find a low-dimensional subspace that captures as much of the variation in the data as possible. PCA finds the directions (principal components) along which the data varies most, ordered by decreasing variance. Projecting onto the top $k$ principal components gives the best rank- $k$ approximation to the centered data.

Setup and Notation

Let $X \in \mathbb{R}^{n \times d}$ be the data matrix with each row $x_i^T$ a data point. Assume the data is centered: $\bar{x} = \frac{1}{n}\sum_i x_i = 0$ . If not, subtract the mean first.

Definition

Sample Covariance Matrix $\hat{Σ}$

The sample covariance matrix is:

$\hat{\Sigma} = \frac{1}{n} X^T X \in \mathbb{R}^{d \times d}$

This is symmetric positive semi-definite. Its eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$ are the variances along the principal directions.

Definition

Principal Components

The principal components are the eigenvectors $v_1, v_2, \ldots, v_d$ of $\hat{\Sigma}$ , ordered by decreasing eigenvalue. The $k$ -th principal component direction $v_k$ is the direction of the $k$ -th largest variance in the data, subject to being orthogonal to $v_1, \ldots, v_{k-1}$ .

PCA as Variance Maximization

Theorem

PCA Maximizes Projected Variance

Statement

The first principal component $v_1$ solves:

$v_1 = \arg\max_{\|v\|=1} \frac{1}{n}\sum_{i=1}^n (v^T x_i)^2 = \arg\max_{\|v\|=1} v^T \hat{\Sigma} v$

The maximum value is $\lambda_1$ , the largest eigenvalue of $\hat{\Sigma}$ . More generally, the $k$ -th principal component maximizes projected variance subject to orthogonality to the first $k-1$ components.

Intuition

Among all unit directions in $\mathbb{R}^d$ , $v_1$ is the one along which the data has the most spread (variance). The second PC is the direction of most remaining spread after removing the $v_1$ component, and so on. Each eigenvalue tells you how much variance that direction captures.

Proof Sketch

We want $\max_{\|v\|=1} v^T \hat{\Sigma} v$ . Form the Lagrangian $L = v^T \hat{\Sigma} v - \lambda(v^T v - 1)$ . Setting $\nabla_v L = 0$ gives $\hat{\Sigma} v = \lambda v$ , an eigenvalue equation. The objective at any eigenvector $v_k$ equals $\lambda_k$ . So the maximum is $\lambda_1$ achieved at $v_1$ . For subsequent PCs, add orthogonality constraints and use induction.

Why It Matters

This shows PCA has a clear statistical interpretation: it finds the directions that explain the most variance in the data. The eigenvalues quantify exactly how much variance each component captures, enabling the scree plot for choosing the number of components.

Failure Mode

PCA maximizes variance, not necessarily "importance." If the signal lives in a low-variance direction (e.g., a rare but meaningful feature), PCA will discard it in favor of high-variance noise. PCA also only captures linear structure; it cannot find nonlinear manifolds.

report a correction →

Worked Derivation: PCA as Reconstruction Error

The variance-maximization statement is the short route. The reconstruction route is the one that often feels harder: it starts from a squared-error projection objective, expands it with trace identities, and ends at the same eigenvector problem.

Use a unit direction $u \in \mathbb{R}^d$ with $u^\top u=1$ . The score vector $Xu \in \mathbb{R}^n$ gives one coordinate per example. The reconstruction back in the original feature space is:

\hat{X} = Xuu^\top \in \mathbb{R}^{n\times d}.

The one-dimensional PCA reconstruction problem is:

\underset{u^\top u=1}{\operatorname{argmin}}\, \lVert X - Xuu^\top\rVert_F^2.

Using $\lVert A\rVert_F^2=\operatorname{Tr}(A^\top A)$ ,

\lVert X-Xuu^\top\rVert_F^2 = \operatorname{Tr}\!\left((X-Xuu^\top)^\top(X-Xuu^\top)\right).

Expand the product:

\begin{aligned} \lVert X-Xuu^\top\rVert_F^2 &= \operatorname{Tr}\!\left( X^\top X -X^\top Xuu^\top -uu^\top X^\top X +uu^\top X^\top Xuu^\top \right). \end{aligned}

The trace trick is only bookkeeping. It lets scalar expressions move into a shape that is easy to compare:

\operatorname{Tr}(X^\top Xuu^\top) = \operatorname{Tr}(u^\top X^\top Xu) = u^\top X^\top Xu.

The first term $\operatorname{Tr}(X^\top X)$ is constant in $u$ , so it does not affect the minimizer. The two middle terms are equal by cyclicity of trace. The last term simplifies because $u$ has unit norm:

(uu^\top)(uu^\top)=u(u^\top u)u^\top=uu^\top.

So the variable-dependent part is:

-2u^\top X^\top Xu + u^\top X^\top Xu = -u^\top X^\top Xu.

Minimizing reconstruction error is therefore equivalent to:

\underset{u^\top u=1}{\operatorname{argmax}}\, u^\top X^\top Xu.

By the Rayleigh-Ritz theorem, the maximizing unit vector is an eigenvector of $X^\top X$ with the largest eigenvalue. This is the first principal component. Because $\hat{\Sigma}=X^\top X/n$ , the same vector is also the top eigenvector of the sample covariance matrix; scaling by $1/n$ changes eigenvalues, not eigenvectors.

For $k > 1$ , write $D=[u_1,\ldots,u_k]\in\mathbb{R}^{d\times k}$ with $D^\top D=I_k$ . The reconstruction is $XDD^\top$ , and the same calculation gives:

\underset{D^\top D=I_k}{\operatorname{argmax}}\, \operatorname{Tr}(D^\top X^\top XD).

Let $X^\top X=V\Lambda V^\top$ , with eigenvalues $\lambda_1\geq\cdots\geq\lambda_d$ . This is the exercise often left after the one-vector derivation: choose the best orthonormal $k$ -frame. In the eigenbasis of $X^\top X$ , write $Y=V^\top D$ . Then $Y^\top Y=I_k$ , and

\operatorname{Tr}(D^\top X^\top XD) = \operatorname{Tr}(Y^\top \Lambda Y) = \sum_{r=1}^d \lambda_r \alpha_r,

where $\alpha_r=\sum_{j=1}^k Y_{rj}^2$ , so $0\leq \alpha_r\leq 1$ and $\sum_r\alpha_r=k$ . Because the eigenvalues are sorted, this weighted sum is maximized by placing weight $1$ on the first $k$ eigenvalues and $0$ on the rest. Thus $D=[v_1,\ldots,v_k]$ maximizes the objective, with value $\lambda_1+\cdots+\lambda_k$ . This is the Ky Fan/Rayleigh-Ritz argument; the same idea can be read as induction on orthogonal complements.

Geometrically, one-component PCA picks the line through the mean with the smallest squared reconstruction arrows. Top- $k$ PCA picks the $k$ -dimensional orthogonal subspace with the smallest total squared reconstruction arrows.

PCA via SVD

The SVD of the centered data matrix $X = U \Sigma V^T$ where $U \in \mathbb{R}^{n \times n}$ , $\Sigma \in \mathbb{R}^{n \times d}$ , and $V \in \mathbb{R}^{d \times d}$ .

The connection: $\hat{\Sigma} = \frac{1}{n}X^T X = V \frac{\Sigma^T \Sigma}{n} V^T$ .

So the right singular vectors $V$ are the principal component directions, and the squared singular values divided by $n$ are the eigenvalues of $\hat{\Sigma}$ : $\lambda_k = \sigma_k^2 / n$ .

In practice, always compute PCA via the SVD of $X$ , not by forming $X^T X$ and computing its eigendecomposition. The SVD is numerically more stable because forming $X^T X$ squares the condition number.

Truncated PCA and Reconstruction

Definition

Truncated PCA $V_{k}$

The rank- $k$ PCA approximation keeps only the top $k$ principal components. The projection of a data point $x$ onto the $k$ -dimensional subspace is:

$\tilde{x} = V_k V_k^T x$

where $V_k = [v_1, \ldots, v_k] \in \mathbb{R}^{d \times k}$ .

Theorem

Eckart-Young Theorem (Best Low-Rank Approximation)

Statement

The best rank- $k$ approximation to $X$ in the Frobenius norm is the truncated SVD $X_k = U_k \Sigma_k V_k^T$ :

$X_k = \arg\min_{\text{rank}(M) \leq k} \|X - M\|_F^2$

The reconstruction error is:

$\|X - X_k\|_F^2 = \sum_{j=k+1}^{r} \sigma_j^2$

where $r = \text{rank}(X)$ .

Intuition

Truncated PCA gives you the closest rank- $k$ matrix to your data. The reconstruction error is exactly the sum of squared singular values you threw away. This is why you look at the eigenvalue spectrum to decide how many components to keep.

Proof Sketch

By the SVD, $X = \sum_j \sigma_j u_j v_j^T$ . Any rank- $k$ matrix can capture at most $k$ singular directions. The Frobenius norm squared of $X$ is $\sum_j \sigma_j^2$ (Parseval). Keeping the largest $k$ singular values minimizes $\sum_{j=k+1}^r \sigma_j^2$ .

Why It Matters

This theorem justifies truncated PCA as optimal dimensionality reduction in the least-squares sense. It also gives a precise formula for reconstruction error in terms of the discarded eigenvalues, which is what the scree plot visualizes.

Failure Mode

Optimality is in the Frobenius norm sense. If your downstream task cares about something other than squared reconstruction error (e.g., classification accuracy), PCA may not give the best low-dimensional representation.

report a correction →

Choosing the Number of Components

Scree plot: Plot eigenvalues $\lambda_1, \lambda_2, \ldots$ versus index. Look for an "elbow" where the eigenvalues drop sharply and then level off. Keep components before the elbow.

Proportion of variance explained: Keep $k$ components such that:

$\frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^d \lambda_j} \geq 0.95 \text{ (or some threshold)}$

Kaiser's rule: Keep components with $\lambda_k > \bar{\lambda}$ (the average eigenvalue). For correlation PCA, this means $\lambda_k > 1$ .

Connection to Matrix Concentration

When does sample PCA approximate population PCA? If $x_i \sim \mathcal{D}$ with population covariance $\Sigma$ , then $\hat{\Sigma} \to \Sigma$ as $n \to \infty$ . But the rate matters.

Matrix concentration inequalities (Vershynin 2018, Theorem 4.6.1) show that for sub-Gaussian data with population covariance $\Sigma$ , $\|\hat{\Sigma} - \Sigma\|_{\text{op}} \leq C \|\Sigma\|_{\text{op}} \big(\sqrt{d/n} + d/n\big)$ with high probability. The $\sqrt{d/n}$ term dominates in the classical regime $d \ll n$ and matches the rate quoted in elementary references; the $d/n$ term takes over once $d$ is comparable to $n$ and is the right object to track in modern high-dimensional analyses. The Davis-Kahan theorem then bounds the angle between sample and population eigenvectors. PCA is reliable when $n \gg d$ and the eigenvalue gaps are large. In high-dimensional settings ( $d \sim n$ or $d \gg n$ ), sample PCA can be highly misleading: the top sample eigenvector may point in a completely wrong direction. This is one instance of the broader phenomenon where classical estimators break down when $d/n$ is not small; see double descent for analogous behavior in supervised learning.

Optional Extensions: Beyond Classical PCA

Classical PCA is a deterministic eigen-decomposition of the sample covariance. Five common variants relax or replace pieces of that recipe and earn their place under different conditions. Each is shipped here as a self-contained mini-section so this page can serve as a single landing for the family.

Probabilistic PCA (PPCA)

Tipping and Bishop (1999) gave PCA an explicit generative-model interpretation:

x = W z + \mu + \varepsilon, \qquad z \sim \mathcal{N}(0, I_k), \quad \varepsilon \sim \mathcal{N}(0, \sigma^2 I_d),

with latent $z \in \mathbb{R}^k$ , loading matrix $W \in \mathbb{R}^{d \times k}$ , and isotropic noise $\sigma^2$ . Marginalizing out $z$ :

x \sim \mathcal{N}(\mu, W W^\top + \sigma^2 I_d).

The MLE for $W$ is $W_{\text{ML}} = U_k (\Lambda_k - \sigma^2 I_k)^{1/2} R$ , where $U_k$ holds the top- $k$ sample-covariance eigenvectors, $\Lambda_k$ the top- $k$ eigenvalues, and $R$ is an arbitrary $k \times k$ orthogonal matrix (rotational ambiguity). The MLE for $\sigma^2$ is the average of the discarded eigenvalues. As $\sigma^2 \to 0$ , the model collapses to classical PCA.

Why it matters. PPCA gives PCA a probabilistic vocabulary: a posterior over latents (useful for missing-data imputation), a marginal likelihood (model selection by Bayesian approaches), and natural EM fitting when data is partially observed. The MLE for $W$ is also a clean application of maximum-likelihood estimation to a latent-variable model. PPCA is the bridge from classical PCA to factor analysis, Gaussian-mixture latent-variable models, and Bayesian variants. Use PPCA when you need a generative prior over the data or have missing entries to handle.

Kernel PCA

Schölkopf, Smola, and Müller (1998) extended PCA to nonlinear feature spaces via the kernel trick. Replace the inner product $\langle x_i, x_j \rangle$ with a kernel $k(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle$ for some feature map $\phi$ . PCA in feature space requires only the centered Gram matrix $K_c = (I - \mathbf{1}\mathbf{1}^\top/n) K (I - \mathbf{1}\mathbf{1}^\top/n)$ , where $K_{ij} = k(x_i, x_j)$ . Eigendecompose $K_c$ to get coefficients; the $k$ -th nonlinear principal component evaluated at a new point $x$ is

\langle \phi(x), v_k \rangle = \sum_{i=1}^n \alpha_{k,i}\, k(x_i, x),

with $\alpha_k$ the (rescaled) eigenvector of $K_c$ .

Why it matters. Kernel PCA captures nonlinear structure (curved manifolds, clusters in feature space) that classical PCA cannot resolve in the original coordinates. The cost is $O(n^2)$ kernel evaluations and an $n \times n$ eigendecomposition rather than $d \times d$ . Use kernel PCA when the nonlinear structure is dominant and $n$ is moderate; for very large $n$ , Nyström approximations or random-feature methods scale it.

Factor Analysis vs PCA

Factor analysis (FA) shares the $x = Wz + \mu + \varepsilon$ structure but allows anisotropic noise:

\varepsilon \sim \mathcal{N}(0, \Psi), \qquad \Psi = \operatorname{diag}(\psi_1, \ldots, \psi_d).

Each variable has its own residual variance $\psi_j$ . The loading matrix $W$ captures common variance shared across variables; $\Psi$ captures variable-specific noise.

Why it matters. PCA absorbs all variance into the top components, so a single high-variance noisy variable will dominate the first PC even if it shares no signal with the others. FA separates the shared (common) variance from the variable-specific (unique) variance. In psychometrics, signal-detection settings, and any case where variables have heterogeneous noise scales, FA is the right model and PCA is at best a fast approximation. Conversely, when the variables are roughly homoscedastic and you only want low-dimensional reconstruction, PCA is simpler and faster.

Sparse PCA

Classical PCA produces dense loading vectors: each PC mixes contributions from every original variable, which destroys interpretability when $d$ is large. Sparse PCA (Zou, Hastie, Tibshirani 2006) adds an $\ell_1$ penalty to the loading vector:

\max_v\; v^\top \hat{\Sigma} v - \lambda \|v\|_1, \qquad \|v\|_2 \le 1.

The penalty zeros out small loadings; the resulting PC depends on a small subset of variables, making it interpretable. The objective is non-convex; solvers use alternating gradient or semidefinite relaxations.

Why it matters. Use sparse PCA in high-dimensional settings (genomics, text, network features) where you want a small set of identifiable variables driving each component. The cost is rotational ambiguity (each sparse PC depends on the regularization path) and a non-convex optimization. Cross-validation on $\lambda$ is standard. The penalty mechanism is the same as in lasso regression; sparse PCA is the unsupervised counterpart.

Robust PCA

Classical PCA's $\ell_2$ objective is sensitive to outliers: a single corrupted observation can rotate the top component substantially. Robust PCA (Candès, Li, Ma, Wright 2011) decomposes the data matrix $X$ as a sum of a low-rank component $L$ (the "true" signal) and a sparse component $S$ (the corruptions):

\min_{L, S}\; \|L\|_* + \lambda \|S\|_1 \quad \text{s.t.} \quad X = L + S,

where $\|\cdot\|_*$ is the nuclear norm (sum of singular values, a convex surrogate for rank) and $\|\cdot\|_1$ is entry-wise. The decomposition recovers $L$ exactly under low-rank-plus-sparse identifiability conditions on $X$ .

Why it matters. Use robust PCA when outliers are gross (entry-level corruption, missing-or-corrupt entries, or rare adversarial perturbations) rather than Gaussian noise. The classic application is video background-foreground separation: the static background is the low-rank $L$ ; the moving foreground is the sparse $S$ . The low-rank component is computed via SVD inside the iterative solver, so robust PCA inherits the SVD-based intuition while replacing the simple top- $k$ truncation with a corruption-aware decomposition. Modern variants extend to streaming and tensor decompositions.

Summary table

Variant	Replaces	Use when
PPCA	Deterministic eigendecomposition with a Gaussian latent-variable model	You need a generative prior, EM for missing data, or a marginal likelihood
Kernel PCA	Linear inner product with a kernel	Nonlinear structure dominates and $n$ is moderate
Factor Analysis	Isotropic noise $\sigma^2 I$ with anisotropic noise $\operatorname{diag}(\psi_j)$	Heterogeneous variable-specific noise scales
Sparse PCA	Dense loadings with $\ell_1$ -penalized loadings	High- $d$ interpretability matters
Robust PCA	$\ell_2$ objective with low-rank-plus-sparse decomposition	Gross outliers (not Gaussian noise)

Each variant is a separate research thread; the references at the end of this page point to the foundational papers.

Canonical Examples

Example

PCA on 2D data

Suppose $n$ points in $\mathbb{R}^2$ form an elliptical cloud with major axis along $(1/\sqrt{2}, 1/\sqrt{2})$ and minor axis along $(-1/\sqrt{2}, 1/\sqrt{2})$ . The first PC is the major axis direction (most variance), and the second PC is the minor axis. Projecting onto just the first PC collapses the data to a line along the major axis, retaining the most variance possible in one dimension.

Common Confusions

Watch Out

PCA and SVD are NOT the same thing

This is the most common confusion. SVD is a matrix factorization: $X = U\Sigma V^T$ . PCA is a statistical procedure that involves (1) centering the data and (2) finding directions of maximum variance. PCA uses SVD (of the centered data matrix) as its computational engine, but they are different concepts. SVD does not center the data. If you run SVD on uncentered data, you do not get PCA. The eigenvalues of $\hat{\Sigma}$ are $\sigma_k^2/n$ , not $\sigma_k$ .

Watch Out

PCA components are not features

The principal components are linear combinations of the original features. The first PC is $v_1^T x = \sum_j v_{1j} x_j$ , which mixes all original features. Interpreting what a principal component "means" requires examining the loadings $v_{1j}$ . High-dimensional PCA components are often uninterpretable.

Watch Out

PCA is not every dimensionality reduction method

PCA is not the same question as LDA, NMF, t-SNE, or UMAP. PCA is linear, unsupervised, and variance/reconstruction based. LDA is supervised and uses labels to separate classes. Non-negative matrix factorization constrains factors to be nonnegative, which can give parts-based decompositions. t-SNE and UMAP are nonlinear visualization methods that emphasize local neighborhoods and may distort global distances.

Summary

PCA finds directions of maximum variance: $\max_{\|v\|=1} v^T \hat{\Sigma} v$
The reconstruction derivation reduces $\min_{\|u\|=1}\lVert X-Xuu^\top\rVert_F^2$ to a Rayleigh quotient
Principal components = eigenvectors of $\hat{\Sigma} = \frac{1}{n}X^T X$
Eigenvalues = variances along principal directions
Compute via SVD of $X$ (not eigendecomposition of $X^T X$ ) for stability
Eckart-Young: truncated SVD is the best low-rank approximation
Reconstruction error = sum of discarded eigenvalues
PCA requires centering; SVD alone does not center

Optional Deeper DetailSparse PCA: the Zou-Hastie-Tibshirani regression-style formulationShow

Advanced section adapted from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2nd ed., Springer 2009), §14.5.5 "Sparse Principal Components," pp. 550-553, and Zou, H., Hastie, T., and Tibshirani, R. (2006), "Sparse Principal Component Analysis," J. Computational and Graphical Statistics 15(2), 265-286.

Standard PCA loadings are dense: every original feature contributes to every principal component. For interpretability, you often want sparse loadings, where each component is a linear combination of a small subset of the original features. The classical approach (rotation methods) is heuristic; the Zou-Hastie-Tibshirani sparse PCA formulation casts the problem as a regression with explicit penalties, giving a tractable optimization with known Karush-Kuhn-Tucker conditions.

The reformulation as a regression. Start from a key fact about the leading PC. Let $v_1$ be the first principal direction and $z_1 = X v_1$ the first principal-component score vector. Then $v_1$ is the solution of the regression

$v_1 \;=\; \arg\min_{v} \;\; \| z_1 - X v \|_2^2$

up to scaling, since $z_1 = X v_1$ minimizes the residual against itself. So principal directions solve regression problems whose response is the score vector they generate. This is exact (not approximate) for the leading PC and generalizes to all $k$ components via iteration.

The advantage: regression problems admit penalties.

Sparse PCA (Zou-Hastie-Tibshirani 2006). Add elastic-net penalties to the regression formulation. The $k$ -component sparse PCA problem is

$\min_{A, B} \;\; \sum_{i=1}^{n} \| x_i - A B^\top x_i \|_2^2 + \lambda \sum_{j=1}^{k} \| \beta_j \|_2^2 + \sum_{j=1}^{k} \lambda_{1,j} \| \beta_j \|_1$

subject to $A^\top A = I_k$ , where $A, B \in \mathbb{R}^{d \times k}$ and $B = [\beta_1, \ldots, \beta_k]$ . The L1 penalty drives the loadings $\beta_j$ to be sparse; the L2 penalty handles cases where $d > n$ . At $\lambda_{1,j} = 0$ this reduces to standard PCA; as $\lambda_{1,j}$ grows, loadings become sparser.

Algorithm. Alternate between fixing $A$ and solving for $B$ (elastic-net regression on each column, decoupled), and fixing $B$ and solving for $A$ in closed form (Procrustes rotation). Converges to a local optimum (the problem is non-convex jointly).

Practical reading. Sparse PCA loadings are not orthogonal in general; the resulting components are correlated. The trade-off is interpretability for orthogonality. Zou-Hastie-Tibshirani recommend sparse PCA only when interpretability of the loadings matters more than orthogonality of the components, and when the number of zero entries genuinely reflects subject-matter structure rather than noise sparsification.

ESL §14.5.5 illustrates with a 6-feature pitprops dataset where standard PCA gives interpretable-but-not-sparse loadings, sparse PCA at moderate $\lambda_1$ gives loadings with most entries zero per component, and the proportion of variance explained per component drops modestly. The graduate point: sparsity-versus-variance is a tunable trade-off, and sparse PCA gives the regularization knob.

Exercises

ExerciseCore

Problem

Show that the first principal component direction $v_1$ maximizes the projected variance $\frac{1}{n}\sum_i (v^T x_i)^2$ subject to $\|v\| = 1$ . Use the method of Lagrange multipliers.

ExerciseAdvanced

Problem

Prove the $k$ -component version: if $D^\top D=I_k$ , then minimizing $\lVert X-XDD^\top\rVert_F^2$ is solved by taking the columns of $D$ to be the top $k$ eigenvectors of $X^\top X$ .

ExerciseAdvanced

Problem

Explain why computing PCA via the eigendecomposition of $X^T X$ is numerically inferior to computing PCA via the SVD of $X$ . What goes wrong in floating-point arithmetic?

References

Canonical:

Jolliffe, Principal Component Analysis (2002), Chapters 1-3
Strang, Linear Algebra and Its Applications (4th ed.), Ch. 6 (Positive Definite Matrices, SVD)
Goodfellow, Bengio, Courville, Deep Learning (2016), Ch. 2 (Linear Algebra, PCA derivation)
Eckart, C. & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211-218.
Davis, C. & Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis, 7(1), 1-46.

Current:

Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 23
Vershynin, High-Dimensional Probability (2018), Chapter 4 (for matrix concentration and sample PCA)
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. §14.5 (principal components, curves, surfaces), §14.5.1 (PCA as best low-rank representation), §14.5.2 (the SVD computation), §14.5.5 (sparse principal components), §14.6 (non-negative matrix factorization).
Bishop, Pattern Recognition and Machine Learning (2006), Ch. 12 (Continuous Latent Variables)

High-Dimensional PCA:

Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, 29(2), 295-327. arXiv:math/0309281.
Baik, J., Ben Arous, G., & Peche, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability, 33(5), 1643-1697. arXiv:math/0403022.

Beyond classical PCA (extension references):

Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B, 61(3), 611-622. The PPCA paper; gives PCA the generative-model interpretation and the EM algorithm for missing data.
Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299-1319. Kernel PCA introduction.
Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2), 265-286. Sparse PCA via $\ell_1$ -penalized loadings.
Candès, E. J., Li, X., Ma, Y., & Wright, J. (2011). Robust principal component analysis? Journal of the ACM, 58(3), 1-37. Low-rank-plus-sparse decomposition; nuclear-norm relaxation.
Bartholomew, Knott, Moustaki, Latent Variable Models and Factor Analysis (3rd ed., 2011). The graduate textbook on FA and its relationship to PCA.

Next Topics

The natural next step from PCA:

Random Matrix Theory Overview: what happens to PCA in high dimensions

Last reviewed: May 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

7

Eigenvalues and Eigenvectorslayer 0A · tier 1
Positive Semidefinite Matriceslayer 0A · tier 1
Singular Value Decompositionlayer 0A · tier 1
Tensors and Tensor Operationslayer 0A · tier 1
Gram Matrices and Kernel Matriceslayer 1 · tier 1

Derived topics

6

Mechanistic Interpretability: Features, Circuits, and Causal Faithfulnesslayer 4 · tier 1
Dimensionality Reduction Theorylayer 2 · tier 2
t-SNE and UMAPlayer 2 · tier 2
Whitening and Decorrelationlayer 2 · tier 2
Random Matrix Theory Overviewlayer 4 · tier 2

+1 more on the derived-topics page.

Graph-backed continuations

Random Matrix Theory Overview Autoencoders for Low-Dimensional Dynamical Structures Dimensionality Reduction Theory Mechanistic Interpretability: Features, Circuits, and Causal Faithfulness t-SNE and UMAP Whitening and Decorrelation