Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Singular Value Decomposition

The SVD A = U Sigma V^T: the most important matrix factorization in applied mathematics. Geometric interpretation, relationship to eigendecomposition, low-rank approximation via Eckart-Young, and applications from PCA to pseudoinverses.

CoreTier 1Stable~60 min

Why This Matters

Am × n=Um × mΣm × nVᵀn × nrotatescalerotateΣ = diagonal:σ₁ ≥ σ₂ ≥ ... ≥ σᵣ > 0σᵢ = singular valuesr = rank(A)U columns: left singular vectorsV columns: right singular vectors

The singular value decomposition is the Swiss army knife of linear algebra. It works for any matrix --- not just square matrices, not just symmetric matrices. Whenever you need to understand the geometry of a linear map, approximate a matrix by a simpler one, solve a least-squares problem, or compute a condition number, you reach for the SVD.

In machine learning, the SVD is behind PCA (truncated SVD of the centered data matrix), latent semantic analysis (SVD of the term-document matrix), recommender systems (low-rank matrix completion), the pseudoinverse (which gives the minimum-norm least-squares solution), and numerical stability analysis (condition numbers).

Mental Model

Every matrix transformation can be decomposed into three steps:

  1. Rotate (or reflect) the input space --- this is VV^\top
  2. Scale along the coordinate axes --- this is Σ\Sigma
  3. Rotate (or reflect) the output space --- this is UU

The singular values σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0 are the scaling factors. They tell you how much the matrix stretches or shrinks in each direction. The columns of VV are the input directions. The columns of UU are the output directions. This is the SVD: A=UΣVA = U\Sigma V^\top.

Formal Setup

Definition

Singular Value Decomposition

Let ARm×nA \in \mathbb{R}^{m \times n} be any matrix. The singular value decomposition of AA is:

A=UΣVA = U \Sigma V^\top

where:

  • URm×mU \in \mathbb{R}^{m \times m} is orthogonal (UU=IU^\top U = I), columns are the left singular vectors
  • VRn×nV \in \mathbb{R}^{n \times n} is orthogonal (VV=IV^\top V = I), columns are the right singular vectors
  • ΣRm×n\Sigma \in \mathbb{R}^{m \times n} is "diagonal" with non-negative entries σ1σ2σmin(m,n)0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0 on the main diagonal (and zeros elsewhere)

The values σi\sigma_i are the singular values of AA.

Thin SVD: In practice, if m>nm > n (tall matrix), we often use the compact form A=UrΣrVrA = U_r \Sigma_r V_r^\top where r=rank(A)r = \text{rank}(A), UrRm×rU_r \in \mathbb{R}^{m \times r}, ΣrRr×r\Sigma_r \in \mathbb{R}^{r \times r}, and VrRn×rV_r \in \mathbb{R}^{n \times r}.

Definition

Singular Values and Eigenvalues Relationship

The singular values of AA are the square roots of the eigenvalues of AAA^\top A (or equivalently, of AAAA^\top):

σi=λi(AA)\sigma_i = \sqrt{\lambda_i(A^\top A)}

The right singular vectors viv_i are eigenvectors of AAA^\top A. The left singular vectors uiu_i are eigenvectors of AAAA^\top.

Why this works: AAA^\top A is symmetric and positive semidefinite, so by the spectral theorem it has non-negative real eigenvalues and orthogonal eigenvectors. The SVD inherits this structure.

Main Theorems

Theorem

Existence and Uniqueness of the SVD

Statement

Every matrix ARm×nA \in \mathbb{R}^{m \times n} has a singular value decomposition A=UΣVA = U\Sigma V^\top. The singular values σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0 are uniquely determined. If all singular values are distinct, the left and right singular vectors are unique up to sign.

Intuition

The SVD always exists because AAA^\top A is always symmetric positive semidefinite, so the spectral theorem always applies to it. The SVD generalizes eigendecomposition to non-square, non-symmetric matrices by factoring through the symmetric matrices AAA^\top A and AAAA^\top.

Proof Sketch

AAA^\top A is symmetric positive semidefinite (xAAx=Ax20x^\top A^\top Ax = \|Ax\|^2 \geq 0). By the spectral theorem, AA=VΛVA^\top A = V\Lambda V^\top with orthogonal VV and Λ=diag(σ12,,σn2)\Lambda = \text{diag}(\sigma_1^2, \ldots, \sigma_n^2) where σi0\sigma_i \geq 0.

For each σi>0\sigma_i > 0, define ui=Avi/σiu_i = Av_i/\sigma_i. Then uiu_i is a unit vector: ui2=viAAvi/σi2=σi2/σi2=1\|u_i\|^2 = v_i^\top A^\top A v_i / \sigma_i^2 = \sigma_i^2/\sigma_i^2 = 1. The uiu_i are orthonormal: uiuj=viAAvj/(σiσj)=σj2vivj/(σiσj)=0u_i^\top u_j = v_i^\top A^\top A v_j / (\sigma_i \sigma_j) = \sigma_j^2 v_i^\top v_j / (\sigma_i\sigma_j) = 0 for iji \neq j.

Extend {u1,,ur}\{u_1, \ldots, u_r\} to an orthonormal basis of Rm\mathbb{R}^m. Then A=UΣVA = U\Sigma V^\top by construction, since Avi=σiuiAv_i = \sigma_i u_i for i=1,,ri = 1, \ldots, r and Avi=0Av_i = 0 for i>ri > r.

Why It Matters

The SVD is universal --- it works for every matrix, every shape, every rank. This makes it the default tool for understanding the geometry of linear maps. The eigendecomposition only works for square matrices and is only orthogonal for symmetric matrices. The SVD has no such limitations.

Failure Mode

When singular values are repeated (σi=σi+1\sigma_i = \sigma_{i+1}), the corresponding singular vectors are not unique --- any orthonormal basis of the associated singular subspace works. This non-uniqueness can cause numerical instability or inconsistency across runs. In PCA, repeated eigenvalues mean the corresponding principal components can be any rotation within the eigenspace.

Theorem

Eckart-Young-Mirsky Theorem (Best Low-Rank Approximation)

Statement

Let A=UΣVA = U\Sigma V^\top with singular values σ1σ2σr>0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0. The best rank-kk approximation to AA (in both the Frobenius norm and the spectral norm) is the truncated SVD:

Ak=i=1kσiuiviA_k = \sum_{i=1}^k \sigma_i u_i v_i^\top

The approximation errors are:

AAkF=i=k+1rσi2,AAk2=σk+1\|A - A_k\|_F = \sqrt{\sum_{i=k+1}^r \sigma_i^2}, \qquad \|A - A_k\|_2 = \sigma_{k+1}

No rank-kk matrix achieves a smaller error.

Intuition

The SVD sorts the "importance" of each direction by singular value. The best rank-kk approximation keeps the kk most important directions and discards the rest. The discarded singular values exactly quantify the approximation error. This is optimal --- you cannot do better by choosing any other rank-kk matrix.

Proof Sketch

For the spectral norm: let BB be any rank-kk matrix. Then null(B)\text{null}(B) has dimension nk\geq n - k. The subspace Vk=span(v1,,vk+1)V_k = \text{span}(v_1, \ldots, v_{k+1}) has dimension k+1k + 1. By dimension counting, null(B)Vk\text{null}(B) \cap V_k contains a nonzero vector ww with w=1\|w\| = 1. Then AB2(AB)w=Awσk+1\|A - B\|_2 \geq \|(A-B)w\| = \|Aw\| \geq \sigma_{k+1} (since Bw=0Bw = 0 and ww lives in the span of the first k+1k+1 right singular vectors). The truncated SVD AkA_k achieves this bound with equality.

For the Frobenius norm, use the fact that AF2=iσi2\|A\|_F^2 = \sum_i \sigma_i^2 and apply a similar subspace argument.

Why It Matters

Eckart-Young is why dimensionality reduction works. PCA keeps the top kk principal components, which is the truncated SVD of the centered data matrix. Latent semantic analysis uses truncated SVD of the term-document matrix. Matrix completion (Netflix problem) assumes the true rating matrix is approximately low-rank. In all these cases, Eckart-Young guarantees that truncated SVD is the optimal low-rank approximation.

The theorem also explains the "scree plot" in PCA: the singular values σi\sigma_i tell you how much information you lose by truncating at rank kk. A sharp drop after σk\sigma_k means the rank-kk approximation is good.

Failure Mode

Eckart-Young gives the best approximation in a global least-squares sense. It does not guarantee that the approximation preserves specific structure (sparsity, non-negativity, interpretability). For structured approximations, you need methods like non-negative matrix factorization (NMF) or sparse PCA, which solve harder optimization problems and do not have clean closed-form solutions.

Key Applications

Pseudoinverse. The Moore-Penrose pseudoinverse of AA is A+=VΣ+UA^+ = V\Sigma^+ U^\top, where Σ+\Sigma^+ replaces each nonzero diagonal entry σi\sigma_i with 1/σi1/\sigma_i. For the least-squares problem minxAxb\min_x \|Ax - b\|, the minimum-norm solution is x+=A+bx^+ = A^+ b.

Condition number. The condition number of AA is:

κ(A)=σmaxσmin=σ1σr\kappa(A) = \frac{\sigma_{\max}}{\sigma_{\min}} = \frac{\sigma_1}{\sigma_r}

A large condition number means AA is nearly singular: small perturbations to the input cause large changes in the output. The condition number controls the numerical stability of solving Ax=bAx = b and the convergence rate of iterative methods applied to AAA^\top A.

Matrix norms. The spectral norm (operator norm) is A2=σ1\|A\|_2 = \sigma_1. The Frobenius norm is AF=iσi2\|A\|_F = \sqrt{\sum_i \sigma_i^2}. The nuclear norm is A=iσi\|A\|_* = \sum_i \sigma_i.

Canonical Examples

Example

SVD of a 2x2 matrix

Let A=(3001)A = \begin{pmatrix} 3 & 0 \\ 0 & 1 \end{pmatrix}. This is already in SVD form: U=V=IU = V = I, Σ=A\Sigma = A, with σ1=3\sigma_1 = 3 and σ2=1\sigma_2 = 1.

The condition number is κ=3/1=3\kappa = 3/1 = 3. The best rank-1 approximation is A1=σ1u1v1=3e1e1=(3000)A_1 = \sigma_1 u_1 v_1^\top = 3 e_1 e_1^\top = \begin{pmatrix} 3 & 0 \\ 0 & 0 \end{pmatrix} with error AA1F=σ2=1\|A - A_1\|_F = \sigma_2 = 1.

Example

PCA as truncated SVD of the data matrix

Given centered data matrix XRn×dX \in \mathbb{R}^{n \times d} (rows are observations), write X=UΣVX = U\Sigma V^\top. Then:

  • The sample covariance is Σ^=1nXX=VΣ2nV\hat{\Sigma} = \frac{1}{n}X^\top X = V \frac{\Sigma^2}{n} V^\top
  • The eigenvectors of Σ^\hat{\Sigma} are the columns of VV (right singular vectors of XX)
  • The eigenvalues of Σ^\hat{\Sigma} are σi2/n\sigma_i^2/n (squared singular values of XX, scaled)
  • The principal component scores are XV=UΣXV = U\Sigma (the projections onto PCs)
  • Truncating to kk components uses Xk=UkΣkVkX_k = U_k \Sigma_k V_k^\top (Eckart-Young optimal)

So PCA and SVD are two views of the same computation. PCA works through the covariance matrix; SVD works directly with the data matrix.

Common Confusions

Watch Out

SVD is not the same as eigendecomposition

The eigendecomposition A=PDP1A = PDP^{-1} requires AA to be square and diagonalizable; PP is generally not orthogonal; and eigenvalues can be negative or complex. The SVD A=UΣVA = U\Sigma V^\top works for any matrix; UU and VV are always orthogonal; and singular values are always non-negative reals. For symmetric positive semidefinite matrices, the SVD and eigendecomposition coincide (U=V=QU = V = Q and Σ=Λ\Sigma = \Lambda).

Watch Out

Truncated SVD is not the SVD of the truncated matrix

The truncated SVD AkA_k is formed by taking the full SVD and zeroing out the smallest rkr - k singular values. It is not computed by first truncating AA somehow and then computing an SVD. The approximation is optimal precisely because it uses the full SVD to determine which directions to keep.

Watch Out

The condition number involves the smallest nonzero singular value

The condition number κ=σ1/σr\kappa = \sigma_1/\sigma_r uses the smallest nonzero singular value. If AA is rank-deficient (σr=0\sigma_r = 0), the condition number is infinite, reflecting the fact that AA is singular and cannot be inverted. Some authors define κ\kappa using σn\sigma_n (which might be zero for tall matrices), so always check the convention.

Summary

  • SVD exists for every matrix: A=UΣVA = U\Sigma V^\top (orthogonal, diagonal, orthogonal)
  • Geometric interpretation: rotate, scale, rotate
  • Singular values of AA = square roots of eigenvalues of AAA^\top A
  • Eckart-Young: truncated SVD is the best rank-kk approximation
  • Condition number κ=σmax/σmin\kappa = \sigma_{\max}/\sigma_{\min} measures sensitivity to perturbations
  • PCA = truncated SVD of the centered data matrix
  • Pseudoinverse A+=VΣ+UA^+ = V\Sigma^+ U^\top gives the minimum-norm least-squares solution

Exercises

ExerciseCore

Problem

Compute the SVD of A=(1100)A = \begin{pmatrix} 1 & 1 \\ 0 & 0 \end{pmatrix}. What is its rank? What is the best rank-1 approximation?

ExerciseAdvanced

Problem

Let ARm×nA \in \mathbb{R}^{m \times n} with SVD A=UΣVA = U\Sigma V^\top. Show that the nuclear norm A=iσi\|A\|_* = \sum_i \sigma_i equals max{tr(BA):B21}\max\{\text{tr}(B^\top A) : \|B\|_2 \leq 1\}. Why does the nuclear norm arise as the convex relaxation of the rank function in matrix completion problems?

References

Canonical:

  • Golub & Van Loan, Matrix Computations (4th ed., 2013), Chapter 2
  • Horn & Johnson, Matrix Analysis (2nd ed., 2012), Chapter 7
  • Strang, Linear Algebra and Its Applications (4th ed., 2006), Chapter 6

Current:

  • Trefethen & Bau, Numerical Linear Algebra (1997), Lectures 4-5
  • Halko, Martinsson, Tropp, "Finding Structure with Randomness" (2011) --- randomized SVD

Next Topics

Building on the SVD:

  • Principal component analysis: SVD of the data matrix in action
  • Conditioning and condition number: the ratio σ1/σr\sigma_1/\sigma_r and numerical stability

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics