Foundations
Singular Value Decomposition
The SVD A = U Sigma V^T: the most important matrix factorization in applied mathematics. Geometric interpretation, relationship to eigendecomposition, low-rank approximation via Eckart-Young, and applications from PCA to pseudoinverses.
Prerequisites
Why This Matters
The singular value decomposition is the Swiss army knife of linear algebra. It works for any matrix --- not just square matrices, not just symmetric matrices. Whenever you need to understand the geometry of a linear map, approximate a matrix by a simpler one, solve a least-squares problem, or compute a condition number, you reach for the SVD.
In machine learning, the SVD is behind PCA (truncated SVD of the centered data matrix), latent semantic analysis (SVD of the term-document matrix), recommender systems (low-rank matrix completion), the pseudoinverse (which gives the minimum-norm least-squares solution), and numerical stability analysis (condition numbers).
Mental Model
Every matrix transformation can be decomposed into three steps:
- Rotate (or reflect) the input space --- this is
- Scale along the coordinate axes --- this is
- Rotate (or reflect) the output space --- this is
The singular values are the scaling factors. They tell you how much the matrix stretches or shrinks in each direction. The columns of are the input directions. The columns of are the output directions. This is the SVD: .
Formal Setup
Singular Value Decomposition
Let be any matrix. The singular value decomposition of is:
where:
- is orthogonal (), columns are the left singular vectors
- is orthogonal (), columns are the right singular vectors
- is "diagonal" with non-negative entries on the main diagonal (and zeros elsewhere)
The values are the singular values of .
Thin SVD: In practice, if (tall matrix), we often use the compact form where , , , and .
Singular Values and Eigenvalues Relationship
The singular values of are the square roots of the eigenvalues of (or equivalently, of ):
The right singular vectors are eigenvectors of . The left singular vectors are eigenvectors of .
Why this works: is symmetric and positive semidefinite, so by the spectral theorem it has non-negative real eigenvalues and orthogonal eigenvectors. The SVD inherits this structure.
Main Theorems
Existence and Uniqueness of the SVD
Statement
Every matrix has a singular value decomposition . The singular values are uniquely determined. If all singular values are distinct, the left and right singular vectors are unique up to sign.
Intuition
The SVD always exists because is always symmetric positive semidefinite, so the spectral theorem always applies to it. The SVD generalizes eigendecomposition to non-square, non-symmetric matrices by factoring through the symmetric matrices and .
Proof Sketch
is symmetric positive semidefinite (). By the spectral theorem, with orthogonal and where .
For each , define . Then is a unit vector: . The are orthonormal: for .
Extend to an orthonormal basis of . Then by construction, since for and for .
Why It Matters
The SVD is universal --- it works for every matrix, every shape, every rank. This makes it the default tool for understanding the geometry of linear maps. The eigendecomposition only works for square matrices and is only orthogonal for symmetric matrices. The SVD has no such limitations.
Failure Mode
When singular values are repeated (), the corresponding singular vectors are not unique --- any orthonormal basis of the associated singular subspace works. This non-uniqueness can cause numerical instability or inconsistency across runs. In PCA, repeated eigenvalues mean the corresponding principal components can be any rotation within the eigenspace.
Eckart-Young-Mirsky Theorem (Best Low-Rank Approximation)
Statement
Let with singular values . The best rank- approximation to (in both the Frobenius norm and the spectral norm) is the truncated SVD:
The approximation errors are:
No rank- matrix achieves a smaller error.
Intuition
The SVD sorts the "importance" of each direction by singular value. The best rank- approximation keeps the most important directions and discards the rest. The discarded singular values exactly quantify the approximation error. This is optimal --- you cannot do better by choosing any other rank- matrix.
Proof Sketch
For the spectral norm: let be any rank- matrix. Then has dimension . The subspace has dimension . By dimension counting, contains a nonzero vector with . Then (since and lives in the span of the first right singular vectors). The truncated SVD achieves this bound with equality.
For the Frobenius norm, use the fact that and apply a similar subspace argument.
Why It Matters
Eckart-Young is why dimensionality reduction works. PCA keeps the top principal components, which is the truncated SVD of the centered data matrix. Latent semantic analysis uses truncated SVD of the term-document matrix. Matrix completion (Netflix problem) assumes the true rating matrix is approximately low-rank. In all these cases, Eckart-Young guarantees that truncated SVD is the optimal low-rank approximation.
The theorem also explains the "scree plot" in PCA: the singular values tell you how much information you lose by truncating at rank . A sharp drop after means the rank- approximation is good.
Failure Mode
Eckart-Young gives the best approximation in a global least-squares sense. It does not guarantee that the approximation preserves specific structure (sparsity, non-negativity, interpretability). For structured approximations, you need methods like non-negative matrix factorization (NMF) or sparse PCA, which solve harder optimization problems and do not have clean closed-form solutions.
Key Applications
Pseudoinverse. The Moore-Penrose pseudoinverse of is , where replaces each nonzero diagonal entry with . For the least-squares problem , the minimum-norm solution is .
Condition number. The condition number of is:
A large condition number means is nearly singular: small perturbations to the input cause large changes in the output. The condition number controls the numerical stability of solving and the convergence rate of iterative methods applied to .
Matrix norms. The spectral norm (operator norm) is . The Frobenius norm is . The nuclear norm is .
Canonical Examples
SVD of a 2x2 matrix
Let . This is already in SVD form: , , with and .
The condition number is . The best rank-1 approximation is with error .
PCA as truncated SVD of the data matrix
Given centered data matrix (rows are observations), write . Then:
- The sample covariance is
- The eigenvectors of are the columns of (right singular vectors of )
- The eigenvalues of are (squared singular values of , scaled)
- The principal component scores are (the projections onto PCs)
- Truncating to components uses (Eckart-Young optimal)
So PCA and SVD are two views of the same computation. PCA works through the covariance matrix; SVD works directly with the data matrix.
Common Confusions
SVD is not the same as eigendecomposition
The eigendecomposition requires to be square and diagonalizable; is generally not orthogonal; and eigenvalues can be negative or complex. The SVD works for any matrix; and are always orthogonal; and singular values are always non-negative reals. For symmetric positive semidefinite matrices, the SVD and eigendecomposition coincide ( and ).
Truncated SVD is not the SVD of the truncated matrix
The truncated SVD is formed by taking the full SVD and zeroing out the smallest singular values. It is not computed by first truncating somehow and then computing an SVD. The approximation is optimal precisely because it uses the full SVD to determine which directions to keep.
The condition number involves the smallest nonzero singular value
The condition number uses the smallest nonzero singular value. If is rank-deficient (), the condition number is infinite, reflecting the fact that is singular and cannot be inverted. Some authors define using (which might be zero for tall matrices), so always check the convention.
Summary
- SVD exists for every matrix: (orthogonal, diagonal, orthogonal)
- Geometric interpretation: rotate, scale, rotate
- Singular values of = square roots of eigenvalues of
- Eckart-Young: truncated SVD is the best rank- approximation
- Condition number measures sensitivity to perturbations
- PCA = truncated SVD of the centered data matrix
- Pseudoinverse gives the minimum-norm least-squares solution
Exercises
Problem
Compute the SVD of . What is its rank? What is the best rank-1 approximation?
Problem
Let with SVD . Show that the nuclear norm equals . Why does the nuclear norm arise as the convex relaxation of the rank function in matrix completion problems?
References
Canonical:
- Golub & Van Loan, Matrix Computations (4th ed., 2013), Chapter 2
- Horn & Johnson, Matrix Analysis (2nd ed., 2012), Chapter 7
- Strang, Linear Algebra and Its Applications (4th ed., 2006), Chapter 6
Current:
- Trefethen & Bau, Numerical Linear Algebra (1997), Lectures 4-5
- Halko, Martinsson, Tropp, "Finding Structure with Randomness" (2011) --- randomized SVD
Next Topics
Building on the SVD:
- Principal component analysis: SVD of the data matrix in action
- Conditioning and condition number: the ratio and numerical stability
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Eigenvalues and EigenvectorsLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
Builds on This
- Distributional SemanticsLayer 2
- Principal Component AnalysisLayer 1