Singular Value Decomposition

Sneiderman, Robby

Foundations

Singular Value Decomposition

The SVD A = UΣVᵀ: the most important matrix factorization in applied mathematics. Geometric interpretation, relationship to eigendecomposition, low-rank approximation via Eckart-Young, and applications from PCA to pseudoinverses.

CoreTier 1StableCore spine~60 min

Prerequisites

Eigenvalues and Eigenvectors Linear Independence Matrix Norms Matrix Operations and Properties

Start 8-question practice · 12 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

foundations | layer 0A | tier 1. This page has 4 direct prerequisites and 5 published dependents.

Open Atlas Prerequisites Leads to

What next

Principal Component Analysis

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The singular value decomposition is the Swiss army knife of linear algebra. It works for any matrix --- not just square matrices, not just symmetric matrices. Whenever you need to understand the geometry of a linear map, approximate a matrix by a simpler one, solve a least-squares problem, or compute a condition number, you reach for the SVD.

A generalizes the absolute value of an eigenvalue to non-square matrices. The lowercase denotes one such value; the capital denotes the diagonal matrix collecting them.

In machine learning, the SVD is behind PCA (truncated SVD of the centered data matrix), latent semantic analysis (SVD of the term-document matrix), recommender systems (low-rank matrix completion), the pseudoinverse (which gives the minimum-norm least-squares solution), and numerical stability analysis (condition numbers).

Mental Model

Every matrix transformation can be decomposed into three steps:

Rotate (or reflect) the input space --- this is $V^\top$
Scale along the coordinate axes --- this is $\Sigma$
Rotate (or reflect) the output space --- this is $U$

The singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$ are the scaling factors. They tell you how much the matrix stretches or shrinks in each direction. The columns of $V$ are the input directions. The columns of $U$ are the output directions. This is the SVD: $A = U\Sigma V^\top$ .

Formal Setup

Definition

Singular Value Decomposition $A = U Σ V^{⊤}$

Let $A \in \mathbb{R}^{m \times n}$ be any matrix. The singular value decomposition of $A$ is:

$A = U \Sigma V^\top$

where:

$U \in \mathbb{R}^{m \times m}$ is orthogonal ( $U^\top U = I$ ), columns are the left singular vectors
$V \in \mathbb{R}^{n \times n}$ is orthogonal ( $V^\top V = I$ ), columns are the right singular vectors
$\Sigma \in \mathbb{R}^{m \times n}$ is "diagonal" with non-negative entries $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0$ on the main diagonal (and zeros elsewhere)

The values $\sigma_i$ are the singular values of $A$ .

Thin SVD: In practice, if $m > n$ (tall matrix), we often use the compact form $A = U_r \Sigma_r V_r^\top$ where $r = \text{rank}(A)$ , $U_r \in \mathbb{R}^{m \times r}$ , $\Sigma_r \in \mathbb{R}^{r \times r}$ , and $V_r \in \mathbb{R}^{n \times r}$ .

Definition

Singular Values and Eigenvalues Relationship

The singular values of $A$ are the square roots of the eigenvalues of $A^\top A$ (or equivalently, of $AA^\top$ ):

$\sigma_i = \sqrt{\lambda_i(A^\top A)}$

The right singular vectors $v_i$ are eigenvectors of $A^\top A$ . The left singular vectors $u_i$ are eigenvectors of $AA^\top$ .

Why this works: $A^\top A$ is symmetric and positive semidefinite, so by the spectral theorem it has non-negative real eigenvalues and orthogonal eigenvectors. The SVD inherits this structure.

Main Theorems

Theorem

Existence and Uniqueness of the SVD

Statement

Every matrix $A \in \mathbb{R}^{m \times n}$ has a singular value decomposition $A = U\Sigma V^\top$ . The singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$ are uniquely determined. Uniqueness of the singular vectors is more delicate:

For each simple positive singular value $\sigma_i$ (i.e.\ a positive $\sigma_i$ that appears with multiplicity one), the pair $(u_i, v_i)$ is unique up to a joint sign flip: $(u_i, v_i) \mapsto (-u_i, -v_i)$ leaves $A = U\Sigma V^\top$ unchanged. Flipping only one of the two destroys the factorization.
For a repeated positive singular value of multiplicity $k$ , the corresponding columns of $U$ and $V$ are unique only up to an orthogonal change of basis $Q \in O(k)$ applied jointly within that eigenspace.
Singular vectors associated with $\sigma = 0$ (which always appear when $\mathrm{rank}(A) < \min(m, n)$ , e.g.\ for rectangular or rank-deficient matrices) are arbitrary orthonormal bases of $\ker(A)$ and $\ker(A^\top)$ respectively. They are not determined by $A$ at all.

So the often-quoted "unique up to sign" applies cleanly only when all singular values are simple and positive (in particular, $A$ is square and full-rank).

Intuition

The SVD always exists because $A^\top A$ is always symmetric positive semidefinite, so the spectral theorem always applies to it. The SVD generalizes eigendecomposition to non-square, non-symmetric matrices by factoring through the symmetric matrices $A^\top A$ and $AA^\top$ .

Proof Sketch

$A^\top A$ is symmetric positive semidefinite ( $x^\top A^\top Ax = \|Ax\|^2 \geq 0$ ). By the spectral theorem, $A^\top A = V\Lambda V^\top$ with orthogonal $V$ and $\Lambda = \text{diag}(\sigma_1^2, \ldots, \sigma_n^2)$ where $\sigma_i \geq 0$ .

For each $\sigma_i > 0$ , define $u_i = Av_i/\sigma_i$ . Then $u_i$ is a unit vector: $\|u_i\|^2 = v_i^\top A^\top A v_i / \sigma_i^2 = \sigma_i^2/\sigma_i^2 = 1$ . The $u_i$ are orthonormal: $u_i^\top u_j = v_i^\top A^\top A v_j / (\sigma_i \sigma_j) = \sigma_j^2 v_i^\top v_j / (\sigma_i\sigma_j) = 0$ for $i \neq j$ .

Extend $\{u_1, \ldots, u_r\}$ to an orthonormal basis of $\mathbb{R}^m$ . Then $A = U\Sigma V^\top$ by construction, since $Av_i = \sigma_i u_i$ for $i = 1, \ldots, r$ and $Av_i = 0$ for $i > r$ .

Why It Matters

The SVD is universal --- it works for every matrix, every shape, every rank. This makes it the default tool for understanding the geometry of linear maps. The eigendecomposition only works for square matrices and is only orthogonal for symmetric matrices. The SVD has no such limitations.

Failure Mode

When singular values are repeated ( $\sigma_i = \sigma_{i+1}$ ), the corresponding singular vectors are not unique --- any orthonormal basis of the associated singular subspace works. This non-uniqueness can cause numerical instability or inconsistency across runs. In PCA, repeated eigenvalues mean the corresponding principal components can be any rotation within the eigenspace.

report a correction →

Theorem

Eckart-Young-Mirsky Theorem (Best Low-Rank Approximation)

Statement

Let $A = U\Sigma V^\top$ with singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0$ . The best rank- $k$ approximation to $A$ (in both the Frobenius norm and the spectral norm) is the truncated SVD:

$A_k = \sum_{i=1}^k \sigma_i u_i v_i^\top$

The approximation errors are:

$\|A - A_k\|_F = \sqrt{\sum_{i=k+1}^r \sigma_i^2}, \qquad \|A - A_k\|_2 = \sigma_{k+1}$

No rank- $k$ matrix achieves a smaller error.

Intuition

The SVD sorts the "importance" of each direction by singular value. The best rank- $k$ approximation keeps the $k$ most important directions and discards the rest. The discarded singular values exactly quantify the approximation error. This is optimal --- you cannot do better by choosing any other rank- $k$ matrix.

Proof Sketch

For the spectral norm: let $B$ be any rank- $k$ matrix. Then $\text{null}(B)$ has dimension $\geq n - k$ . The subspace $V_k = \text{span}(v_1, \ldots, v_{k+1})$ has dimension $k + 1$ . By dimension counting, $\text{null}(B) \cap V_k$ contains a nonzero vector $w$ with $\|w\| = 1$ . Then $\|A - B\|_2 \geq \|(A-B)w\| = \|Aw\| \geq \sigma_{k+1}$ (since $Bw = 0$ and $w$ lives in the span of the first $k+1$ right singular vectors). The truncated SVD $A_k$ achieves this bound with equality.

For the Frobenius norm, use the fact that $\|A\|_F^2 = \sum_i \sigma_i^2$ and apply a similar subspace argument.

Why It Matters

Eckart-Young is why dimensionality reduction works. PCA keeps the top $k$ principal components, which is the truncated SVD of the centered data matrix. Latent semantic analysis uses truncated SVD of the term-document matrix. Matrix completion (Netflix problem) assumes the true rating matrix is approximately low-rank. In all these cases, Eckart-Young guarantees that truncated SVD is the optimal low-rank approximation.

The theorem also explains the "scree plot" in PCA: the singular values $\sigma_i$ tell you how much information you lose by truncating at rank $k$ . A sharp drop after $\sigma_k$ means the rank- $k$ approximation is good.

Failure Mode

Eckart-Young gives the best approximation in a global least-squares sense. It does not guarantee that the approximation preserves specific structure (sparsity, non-negativity, interpretability). For structured approximations, you need methods like non-negative matrix factorization (NMF) or sparse PCA, which solve harder optimization problems and do not have clean closed-form solutions.

report a correction →

Key Applications

Pseudoinverse. The Moore-Penrose pseudoinverse of $A$ is $A^+ = V\Sigma^+ U^\top$ , where $\Sigma^+$ replaces each nonzero diagonal entry $\sigma_i$ with $1/\sigma_i$ and zeroes the rest. For the least-squares problem $\min_x \|Ax - b\|$ , the minimum-norm solution is $x^+ = A^+ b$ . This is the safe universal definition: the SVD form works for every matrix shape and never tries to invert a zero singular value. The Pseudoinverse Geometry Lab walks the three cases (overdetermined, underdetermined, rank-deficient) interactively.

Condition number. The condition number of $A$ is:

$\kappa(A) = \frac{\sigma_{\max}}{\sigma_{\min}} = \frac{\sigma_1}{\sigma_r}$

A large condition number means $A$ is nearly singular: small perturbations to the input cause large changes in the output. The condition number controls the numerical stability of solving $Ax = b$ and the convergence rate of iterative methods applied to $A^\top A$ .

Matrix norms. The spectral norm (operator norm) is $\|A\|_2 = \sigma_1$ . The Frobenius norm is $\|A\|_F = \sqrt{\sum_i \sigma_i^2}$ . The nuclear norm is $\|A\|_* = \sum_i \sigma_i$ .

Canonical Examples

Example

SVD of a 2x2 matrix

Let $A = \begin{pmatrix} 3 & 0 \\ 0 & 1 \end{pmatrix}$ . This is already in SVD form: $U = V = I$ , $\Sigma = A$ , with $\sigma_1 = 3$ and $\sigma_2 = 1$ .

The condition number is $\kappa = 3/1 = 3$ . The best rank-1 approximation is $A_1 = \sigma_1 u_1 v_1^\top = 3 e_1 e_1^\top = \begin{pmatrix} 3 & 0 \\ 0 & 0 \end{pmatrix}$ with error $\|A - A_1\|_F = \sigma_2 = 1$ .

Example

PCA as truncated SVD of the data matrix

Given centered data matrix $X \in \mathbb{R}^{n \times d}$ (rows are observations), write $X = U\Sigma V^\top$ . Then:

The sample covariance is $\hat{\Sigma} = \frac{1}{n}X^\top X = V \frac{\Sigma^2}{n} V^\top$
The eigenvectors of $\hat{\Sigma}$ are the columns of $V$ (right singular vectors of $X$ )
The eigenvalues of $\hat{\Sigma}$ are $\sigma_i^2/n$ (squared singular values of $X$ , scaled)
The principal component scores are $XV = U\Sigma$ (the projections onto PCs)
Truncating to $k$ components uses $X_k = U_k \Sigma_k V_k^\top$ (Eckart-Young optimal)

So PCA and SVD are two views of the same computation. PCA works through the covariance matrix; SVD works directly with the data matrix.

Common Confusions

Watch Out

SVD is not the same as eigendecomposition

The eigendecomposition $A = PDP^{-1}$ requires $A$ to be square and diagonalizable; $P$ is generally not orthogonal; and eigenvalues can be negative or complex. The SVD $A = U\Sigma V^\top$ works for any matrix; $U$ and $V$ are always orthogonal; and singular values are always non-negative reals. For symmetric positive semidefinite matrices, the SVD and eigendecomposition coincide ( $U = V = Q$ and $\Sigma = \Lambda$ ).

Watch Out

Truncated SVD is not the SVD of the truncated matrix

The truncated SVD $A_k$ is formed by taking the full SVD and zeroing out the smallest $r - k$ singular values. It is not computed by first truncating $A$ somehow and then computing an SVD. The approximation is optimal precisely because it uses the full SVD to determine which directions to keep.

Watch Out

The condition number involves the smallest nonzero singular value

The condition number $\kappa = \sigma_1/\sigma_r$ uses the smallest nonzero singular value. If $A$ is rank-deficient ( $\sigma_r = 0$ ), the condition number is infinite, reflecting the fact that $A$ is singular and cannot be inverted. Some authors define $\kappa$ using $\sigma_n$ (which might be zero for tall matrices), so always check the convention.

Summary

SVD exists for every matrix: $A = U\Sigma V^\top$ (orthogonal, diagonal, orthogonal)
Geometric interpretation: rotate, scale, rotate
Singular values of $A$ = square roots of eigenvalues of $A^\top A$
Eckart-Young: truncated SVD is the best rank- $k$ approximation
Condition number $\kappa = \sigma_{\max}/\sigma_{\min}$ measures sensitivity to perturbations
PCA = truncated SVD of the centered data matrix
Pseudoinverse $A^+ = V\Sigma^+ U^\top$ gives the minimum-norm least-squares solution

Exercises

ExerciseCore

Problem

Compute the SVD of $A = \begin{pmatrix} 1 & 1 \\ 0 & 0 \end{pmatrix}$ . What is its rank? What is the best rank-1 approximation?

ExerciseAdvanced

Problem

Let $A \in \mathbb{R}^{m \times n}$ with SVD $A = U\Sigma V^\top$ . Show that the nuclear norm $\|A\|_* = \sum_i \sigma_i$ equals $\max\{\text{tr}(B^\top A) : \|B\|_2 \leq 1\}$ . Why does the nuclear norm arise as the convex relaxation of the rank function in matrix completion problems?

References

Canonical:

Golub & Van Loan, Matrix Computations (4th ed., 2013), Chapter 2
Horn & Johnson, Matrix Analysis (2nd ed., 2012), Chapter 7
Strang, Linear Algebra and Its Applications (4th ed., 2006), Chapter 6
Eckart and Young, "The Approximation of One Matrix by Another of Lower Rank" (1936), pp. 211-218. original best low-rank approximation result

Current:

Trefethen & Bau, Numerical Linear Algebra (1997), Sections/Lectures 4-5
Halko, Martinsson, Tropp, "Finding Structure with Randomness" (2011), arXiv:0909.4061 --- randomized SVD

Next Topics

Building on the SVD:

Principal component analysis: SVD of the data matrix in action
Conditioning and condition number: the ratio $\sigma_1/\sigma_r$ and numerical stability

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Eigenvalues and Eigenvectorslayer 0A · tier 1
Linear Independencelayer 0A · tier 1
Matrix Normslayer 0A · tier 1
Matrix Operations and Propertieslayer 0A · tier 1

Derived topics

5

Conditioning and Condition Numberlayer 1 · tier 1
Principal Component Analysislayer 1 · tier 1
Ridge Resolventslayer 3 · tier 1
Distributional Semanticslayer 2 · tier 2
Word Embeddingslayer 2 · tier 2

Graph-backed continuations

Principal Component Analysis Conditioning and Condition Number Distributional Semantics Word Embeddings