Numerical Linear Algebra

Sneiderman, Robby

Numerical Optimization

Numerical Linear Algebra

Algorithms for solving linear systems, computing eigenvalues, and factoring matrices. Every linear regression, PCA, and SVD computation depends on these methods.

CoreTier 2StableSupporting~55 min

Prerequisites

Eigenvalues and Eigenvectors Matrix Operations and Properties

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

numerical-optimization | layer 1 | tier 2. This page has 2 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Conjugate Gradient Methods

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Linear algebra is computed, not just stated. When you call np.linalg.solve, sklearn's linear regression, or torch.svd, the computer runs one of the algorithms described here. Choosing the right algorithm and understanding when numerical issues arise (ill-conditioning, loss of orthogonality, pivoting failures) is the difference between getting correct results and getting garbage.

Mental Model

There are two families of methods. Direct methods (LU, Cholesky, QR) produce exact answers (up to floating-point roundoff) in a fixed number of operations, typically $O(n^3)$ . Iterative methods (conjugate gradient, GMRES, power iteration) produce a sequence of improving approximations and are preferred when $n$ is large and the matrix is sparse or structured.

Solving Linear Systems: $Ax = b$

Definition

LU Factorization $A = LU$

Factor $A \in \mathbb{R}^{n \times n}$ as $A = LU$ where $L$ is lower triangular and $U$ is upper triangular. Then solve $Ly = b$ (forward substitution, $O(n^2)$ ) followed by $Ux = y$ (back substitution, $O(n^2)$ ). The factorization itself costs $O(n^3/3)$ operations. In practice, use partial pivoting: $PA = LU$ where $P$ is a permutation matrix. Pivoting prevents division by small numbers and is needed for numerical stability.

Definition

Cholesky Factorization $A = L L^{⊤}$

If $A$ is symmetric positive definite (SPD), factor $A = LL^\top$ where $L$ is lower triangular with positive diagonal entries. Cost: $O(n^3/6)$ , half of LU. The Cholesky factorization exists if and only if $A$ is SPD. It is the preferred method for solving SPD systems (e.g., normal equations in least squares, covariance matrices, kernel matrices).

Definition

QR Factorization $A = QR$

Factor $A \in \mathbb{R}^{m \times n}$ (with $m \geq n$ ) as $A = QR$ where $Q \in \mathbb{R}^{m \times n}$ has orthonormal columns and $R \in \mathbb{R}^{n \times n}$ is upper triangular. The least squares solution to $\|Ax - b\|_2^2$ is $x = R^{-1}Q^\top b$ . QR is more numerically stable than solving the normal equations $A^\top A x = A^\top b$ via Cholesky, because $\kappa(A^\top A) = \kappa(A)^2$ .

Main Theorems

Theorem

Existence and Uniqueness of Cholesky Factorization

Statement

If $A \in \mathbb{R}^{n \times n}$ is symmetric positive definite, there exists a unique lower triangular matrix $L$ with positive diagonal entries such that $A = LL^\top$ .

Intuition

SPD matrices have all positive eigenvalues. The Cholesky factor $L$ is a "square root" of $A$ . Uniqueness comes from requiring positive diagonal entries, which pins down the sign ambiguity in any square root.

Proof Sketch

By induction on $n$ . Write $A$ in block form: $A = \begin{bmatrix} a_{11} & v^\top \\ v & A' \end{bmatrix}$ where $a_{11} > 0$ (since $A$ is SPD). Set $l_{11} = \sqrt{a_{11}}$ , $l = v/l_{11}$ , and recurse on the Schur complement $A' - ll^\top$ , which is also SPD. The positive diagonal follows from $a_{11} > 0$ at each step.

Why It Matters

Cholesky is the fastest direct solver for SPD systems, twice as fast as LU and guaranteed to succeed without pivoting. In ML: kernel matrices, precision matrices, and the normal equations matrix $X^\top X$ are all SPD (assuming full rank). Cholesky factorization of $X^\top X$ is the standard method for ordinary least squares in many libraries.

Failure Mode

If $A$ is not positive definite (has a zero or negative eigenvalue), Cholesky will fail: the algorithm attempts to take the square root of a non-positive number. This is actually useful as a diagnostic: failed Cholesky factorization indicates the matrix is not SPD. Near-singular SPD matrices ( $\lambda_{\min} \approx 0$ ) will produce large entries in $L$ and amplify roundoff errors.

report a correction →

Theorem

Condition Number and Perturbation Bound

Statement

For a nonsingular matrix $A$ , if $Ax = b$ and $A(x + \delta x) = b + \delta b$ , then:

$\frac{\|\delta x\|}{\|x\|} \leq \kappa(A) \frac{\|\delta b\|}{\|b\|}$

where $\kappa(A) = \|A\| \cdot \|A^{-1}\|$ is the condition number. For the 2-norm, $\kappa_2(A) = \sigma_{\max}(A)/\sigma_{\min}(A)$ .

Intuition

The condition number measures the worst-case amplification of relative errors. A small perturbation in $b$ can be amplified by up to $\kappa(A)$ in the solution $x$ . If $\kappa(A) = 10^k$ , you lose about $k$ digits of accuracy in the solution.

Proof Sketch

$\delta x = A^{-1} \delta b$ , so $\|\delta x\| \leq \|A^{-1}\| \|\delta b\|$ . Also $\|b\| = \|Ax\| \leq \|A\| \|x\|$ , so $1/\|x\| \leq \|A\|/\|b\|$ . Combining: $\|\delta x\|/\|x\| \leq \|A^{-1}\| \|A\| \|\delta b\|/\|b\|$ .

Why It Matters

This bound explains why some linear systems are "hard" even though $A$ is invertible in exact arithmetic. In ML, the Hessian of the loss function can be ill-conditioned, making Newton's method unstable without regularization. Kernel matrices with small eigenvalues produce ill-conditioned systems. Adding a ridge term $\lambda I$ to $A$ reduces $\kappa(A + \lambda I)$ from $\sigma_{\max}/\sigma_{\min}$ to $(\sigma_{\max} + \lambda)/(\sigma_{\min} + \lambda)$ .

Failure Mode

The bound is tight: there exist perturbations $\delta b$ that achieve equality. But for "typical" perturbations, the actual error may be much smaller than the bound suggests. The condition number is a worst-case measure. Also, the bound applies to the problem, not to any specific algorithm. A backward-stable algorithm (like LU with partial pivoting) solves a nearby problem exactly, but whether the nearby problem has a nearby solution depends on $\kappa(A)$ .

report a correction →

Iterative Methods

Conjugate Gradient (CG)

For SPD systems $Ax = b$ , CG generates iterates $x_k$ that minimize $\|x - x_k\|_A = \sqrt{(x-x_k)^\top A(x-x_k)}$ over the Krylov subspace $\mathcal{K}_k(A, b) = \text{span}\{b, Ab, \ldots, A^{k-1}b\}$ . CG converges in at most $n$ iterations in exact arithmetic. In practice, convergence rate depends on $\kappa(A)$ :

$\|x - x_k\|_A \leq 2\left(\frac{\sqrt{\kappa(A)} - 1}{\sqrt{\kappa(A)} + 1}\right)^k \|x - x_0\|_A$

Preconditioning (replacing $Ax = b$ with $M^{-1}Ax = M^{-1}b$ for a good approximation $M \approx A$ ) reduces the effective condition number.

GMRES

For general (non-symmetric) systems, the Generalized Minimal Residual method (GMRES) minimizes $\|b - Ax_k\|_2$ over the Krylov subspace. Each iteration requires storing one additional basis vector, so memory grows linearly with iteration count. Restarted GMRES (GMRES( $m$ )) limits memory by restarting after $m$ iterations.

Eigenvalue and SVD Computation

Power Iteration

To find the dominant eigenvalue of $A$ : start with $v_0$ , iterate $v_{k+1} = Av_k / \|Av_k\|$ . Converges to the eigenvector for the largest $|\lambda|$ at rate $|\lambda_2/\lambda_1|$ per iteration. Cheap per iteration ( $O(n^2)$ for dense, $O(\text{nnz})$ for sparse) but slow if the eigenvalue gap is small.

QR Algorithm

The standard method for computing all eigenvalues of a dense matrix. Apply shifted QR iterations: at step $k$ , compute $A_k - \mu_k I = Q_k R_k$ , then set $A_{k+1} = R_k Q_k + \mu_k I$ . With implicit shifts and deflation, converges in $O(n^3)$ total. First reduce to Hessenberg form ( $O(n^3)$ ), then each QR step costs $O(n^2)$ .

SVD Computation

The SVD $A = U\Sigma V^\top$ is computed via the Golub-Kahan bidiagonalization: reduce $A$ to bidiagonal form, then apply the QR algorithm to the bidiagonal matrix. Total cost: $O(mn^2)$ for $m \times n$ with $m \geq n$ . Truncated SVD (only the top $k$ singular values) can be computed iteratively in $O(mnk)$ using Lanczos or randomized methods.

Numerical Stability

Backward Stability

An algorithm that computes $f(x)$ in finite precision is backward stable if the computed output $\hat{y}$ satisfies $\hat{y} = f(x + \delta x)$ for some perturbation $\delta x$ of size $O(\epsilon_{\text{mach}}) \|x\|$ . The algorithm did not solve the input you gave it; it solved a slightly perturbed input exactly. The forward error then depends on how sensitive the problem is to that perturbation, which is the condition number's job.

Backward stability separates the algorithm from the problem. LU with partial pivoting, Householder QR, and Cholesky on SPD inputs are all backward stable. The error in the answer is bounded by $\kappa(A) \cdot \epsilon_{\text{mach}}$ regardless of which of these you use; the algorithm contributes the $\epsilon_{\text{mach}}$ factor, the problem contributes the $\kappa(A)$ factor.

Mixed-Precision Iterative Refinement

For large dense systems where memory bandwidth dominates, modern practice is to compute the LU factorization in low precision (FP16 or FP32, fast on tensor cores) and then iteratively correct the residual in higher precision. Given $\hat{x}$ from the low-precision solve, compute $r = b - A\hat{x}$ in higher precision, solve $A d = r$ using the existing low-precision factors, and update $\hat{x} \leftarrow \hat{x} + d$ . A few iterations recover full double-precision accuracy at a fraction of the cost of a full double-precision factorization. This is the standard approach in HPL-MxP (the mixed-precision Linpack benchmark) and is increasingly relevant for LLM training, where FP8 and MXFP4 matmul kernels dominate but final convergence requires higher precision.

Why Naive Gram-Schmidt Fails

Classical Gram-Schmidt orthogonalization computes $q_k = a_k - \sum_{j<k} (q_j^\top a_k) q_j$ and normalizes. In floating point, the computed $q_k$ can lose orthogonality badly. If $\kappa(A) \approx 10^8$ and machine precision is $10^{-16}$ , the computed $Q$ may have $\|Q^\top Q - I\| \approx 10^{-8}$ . Modified Gram-Schmidt subtracts projections one at a time (re-orthogonalizing against previously computed vectors) and achieves $\|Q^\top Q - I\| = O(\kappa(A) \epsilon_{\text{mach}})$ . Householder QR achieves $O(\epsilon_{\text{mach}})$ regardless of conditioning.

Pivoting in LU

Without pivoting, LU factorization can amplify roundoff by a factor exponential in $n$ . Partial pivoting (swapping rows to put the largest entry in the pivot position) reduces the growth factor to at most $2^{n-1}$ in theory, but in practice keeps it near 1 for all but pathological matrices.

Canonical Examples

Example

Least squares: QR vs normal equations

To solve $\min_x \|Ax - b\|^2$ with $A \in \mathbb{R}^{m \times n}$ , $m > n$ :

Normal equations: Form $A^\top A$ and solve $A^\top A x = A^\top b$ via Cholesky. Cost: $O(mn^2 + n^3/6)$ . Problem: $\kappa(A^\top A) = \kappa(A)^2$ . If $\kappa(A) = 10^8$ , solving the normal equations loses all 16 digits of double precision.

QR: Factor $A = QR$ and solve $Rx = Q^\top b$ . Cost: $O(2mn^2)$ . The condition number of $R$ is $\kappa(A)$ , not $\kappa(A)^2$ . This is the numerically stable choice when $A$ is ill-conditioned.

Common Confusions

Watch Out

Direct methods are not always slower than iterative methods

For dense $n \times n$ systems, Cholesky costs $O(n^3/6)$ and is hard to beat. Iterative methods shine when $n$ is large and the matrix is sparse (so $Av$ costs $O(\text{nnz})$ instead of $O(n^2)$ ) or when you only need an approximate solution. For a dense $1000 \times 1000$ SPD system, Cholesky is faster than CG.

Watch Out

Condition number is a property of the problem, not the algorithm

A backward-stable algorithm solves a slightly perturbed problem exactly. Whether this perturbed problem has a nearby solution depends on $\kappa(A)$ . No algorithm can overcome ill-conditioning; regularization (adding $\lambda I$ to $A$ ) changes the problem to a better-conditioned one.

Watch Out

SVD and eigendecomposition are different computations

SVD applies to any $m \times n$ matrix and always exists. Eigendecomposition requires a square matrix and may not exist (defective matrices). For symmetric matrices, SVD and eigendecomposition coincide (singular values = absolute eigenvalues). For non-symmetric or rectangular matrices, they are different.

Summary

LU: $O(n^3/3)$ , needs pivoting, works for general square systems
Cholesky: $O(n^3/6)$ , SPD only, no pivoting needed, fastest direct method
QR: $O(2mn^2)$ , best for least squares, avoids squaring the condition number
CG: iterative, SPD only, convergence depends on $\sqrt{\kappa(A)}$
$\kappa(A) = \sigma_{\max}/\sigma_{\min}$ : large means the problem amplifies errors
Regularization ( $A + \lambda I$ ) reduces the condition number

Exercises

ExerciseCore

Problem

A $3 \times 3$ SPD matrix has eigenvalues $\lambda_1 = 100$ , $\lambda_2 = 10$ , $\lambda_3 = 0.01$ . What is $\kappa_2(A)$ ? If the right-hand side $b$ has a relative error of $10^{-10}$ , what is the worst-case relative error in the solution $x$ ?

ExerciseAdvanced

Problem

Show that the convergence rate of conjugate gradient on an SPD system $Ax = b$ satisfies the bound $\|x - x_k\|_A \leq 2((\sqrt{\kappa} - 1)/(\sqrt{\kappa} + 1))^k \|x - x_0\|_A$ . How many CG iterations are needed to reduce the error by a factor of $10^{-6}$ when $\kappa(A) = 10^4$ ?

References

Canonical:

Trefethen & Bau, Numerical Linear Algebra (1997), Chapters 10-40. The standard introductory text.
Golub & Van Loan, Matrix Computations (4th ed., 2013), Chapters 3-8. The encyclopedic reference.
Demmel, Applied Numerical Linear Algebra (SIAM, 1997). Alternative canonical treatment with strong error-analysis emphasis.
Higham, Accuracy and Stability of Numerical Algorithms (SIAM, 2nd ed., 2002). The definitive reference on backward error analysis and floating-point stability.
Saad, Iterative Methods for Sparse Linear Systems (SIAM, 2nd ed., 2003). Canonical Krylov-method reference (CG, GMRES, preconditioners).

Current:

Nocedal & Wright, Numerical Optimization (2006), Chapter 5 (CG for optimization)
Halko, Martinsson, Tropp, "Finding Structure with Randomness" (2011)
Martinsson & Tropp, "Randomized Numerical Linear Algebra: Foundations and Algorithms" (Acta Numerica, 2020). Modern survey of randomized methods.
Woodruff, "Sketching as a Tool for Numerical Linear Algebra" (2014).

Next Topics

Conjugate gradient methods: CG for optimization (not just linear systems)
Open problems in matrix computation: communication complexity, randomized methods, and the matrix multiplication exponent
Randomized linear algebra: sketching and randomized SVD for large-scale problems

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Eigenvalues and Eigenvectorslayer 0A · tier 1
Matrix Operations and Propertieslayer 0A · tier 1

Derived topics

1

Conjugate Gradient Methodslayer 2 · tier 2

Graph-backed continuations

Conjugate Gradient Methods

Read this page in the graph.

Why This Matters

Mental Model

Solving Linear Systems: Ax=bAx = bAx=b

Main Theorems

Iterative Methods

Conjugate Gradient (CG)

GMRES

Eigenvalue and SVD Computation

Power Iteration

QR Algorithm

SVD Computation

Numerical Stability

Backward Stability

Mixed-Precision Iterative Refinement

Why Naive Gram-Schmidt Fails

Pivoting in LU

Canonical Examples

Common Confusions

Summary

Exercises

References

Next Topics

Required before and derived from this topic

Required prerequisites

Derived topics

Solving Linear Systems: $Ax = b$