Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Numerical Linear Algebra

Algorithms for solving linear systems, computing eigenvalues, and factoring matrices. Every linear regression, PCA, and SVD computation depends on these methods.

CoreTier 2Stable~55 min
0

Why This Matters

Linear algebra is computed, not just stated. When you call np.linalg.solve, sklearn's linear regression, or torch.svd, the computer runs one of the algorithms described here. Choosing the right algorithm and understanding when numerical issues arise (ill-conditioning, loss of orthogonality, pivoting failures) is the difference between getting correct results and getting garbage.

Mental Model

There are two families of methods. Direct methods (LU, Cholesky, QR) produce exact answers (up to floating-point roundoff) in a fixed number of operations, typically O(n3)O(n^3). Iterative methods (conjugate gradient, GMRES, power iteration) produce a sequence of improving approximations and are preferred when nn is large and the matrix is sparse or structured.

Solving Linear Systems: Ax=bAx = b

Definition

LU Factorization

Factor ARn×nA \in \mathbb{R}^{n \times n} as A=LUA = LU where LL is lower triangular and UU is upper triangular. Then solve Ly=bLy = b (forward substitution, O(n2)O(n^2)) followed by Ux=yUx = y (back substitution, O(n2)O(n^2)). The factorization itself costs O(n3/3)O(n^3/3) operations. In practice, use partial pivoting: PA=LUPA = LU where PP is a permutation matrix. Pivoting prevents division by small numbers and is needed for numerical stability.

Definition

Cholesky Factorization

If AA is symmetric positive definite (SPD), factor A=LLA = LL^\top where LL is lower triangular with positive diagonal entries. Cost: O(n3/6)O(n^3/6), half of LU. The Cholesky factorization exists if and only if AA is SPD. It is the preferred method for solving SPD systems (e.g., normal equations in least squares, covariance matrices, kernel matrices).

Definition

QR Factorization

Factor ARm×nA \in \mathbb{R}^{m \times n} (with mnm \geq n) as A=QRA = QR where QRm×nQ \in \mathbb{R}^{m \times n} has orthonormal columns and RRn×nR \in \mathbb{R}^{n \times n} is upper triangular. The least squares solution to Axb22\|Ax - b\|_2^2 is x=R1Qbx = R^{-1}Q^\top b. QR is more numerically stable than solving the normal equations AAx=AbA^\top A x = A^\top b via Cholesky, because κ(AA)=κ(A)2\kappa(A^\top A) = \kappa(A)^2.

Main Theorems

Theorem

Existence and Uniqueness of Cholesky Factorization

Statement

If ARn×nA \in \mathbb{R}^{n \times n} is symmetric positive definite, there exists a unique lower triangular matrix LL with positive diagonal entries such that A=LLA = LL^\top.

Intuition

SPD matrices have all positive eigenvalues. The Cholesky factor LL is a "square root" of AA. Uniqueness comes from requiring positive diagonal entries, which pins down the sign ambiguity in any square root.

Proof Sketch

By induction on nn. Write AA in block form: A=[a11vvA]A = \begin{bmatrix} a_{11} & v^\top \\ v & A' \end{bmatrix} where a11>0a_{11} > 0 (since AA is SPD). Set l11=a11l_{11} = \sqrt{a_{11}}, l=v/l11l = v/l_{11}, and recurse on the Schur complement AllA' - ll^\top, which is also SPD. The positive diagonal follows from a11>0a_{11} > 0 at each step.

Why It Matters

Cholesky is the fastest direct solver for SPD systems, twice as fast as LU and guaranteed to succeed without pivoting. In ML: kernel matrices, precision matrices, and the normal equations matrix XXX^\top X are all SPD (assuming full rank). Cholesky factorization of XXX^\top X is the standard method for ordinary least squares in many libraries.

Failure Mode

If AA is not positive definite (has a zero or negative eigenvalue), Cholesky will fail: the algorithm attempts to take the square root of a non-positive number. This is actually useful as a diagnostic: failed Cholesky factorization indicates the matrix is not SPD. Near-singular SPD matrices (λmin0\lambda_{\min} \approx 0) will produce large entries in LL and amplify roundoff errors.

Theorem

Condition Number and Perturbation Bound

Statement

For a nonsingular matrix AA, if Ax=bAx = b and A(x+δx)=b+δbA(x + \delta x) = b + \delta b, then:

δxxκ(A)δbb\frac{\|\delta x\|}{\|x\|} \leq \kappa(A) \frac{\|\delta b\|}{\|b\|}

where κ(A)=AA1\kappa(A) = \|A\| \cdot \|A^{-1}\| is the condition number. For the 2-norm, κ2(A)=σmax(A)/σmin(A)\kappa_2(A) = \sigma_{\max}(A)/\sigma_{\min}(A).

Intuition

The condition number measures the worst-case amplification of relative errors. A small perturbation in bb can be amplified by up to κ(A)\kappa(A) in the solution xx. If κ(A)=10k\kappa(A) = 10^k, you lose about kk digits of accuracy in the solution.

Proof Sketch

δx=A1δb\delta x = A^{-1} \delta b, so δxA1δb\|\delta x\| \leq \|A^{-1}\| \|\delta b\|. Also b=AxAx\|b\| = \|Ax\| \leq \|A\| \|x\|, so 1/xA/b1/\|x\| \leq \|A\|/\|b\|. Combining: δx/xA1Aδb/b\|\delta x\|/\|x\| \leq \|A^{-1}\| \|A\| \|\delta b\|/\|b\|.

Why It Matters

This bound explains why some linear systems are "hard" even though AA is invertible in exact arithmetic. In ML, the Hessian of the loss function can be ill-conditioned, making Newton's method unstable without regularization. Kernel matrices with small eigenvalues produce ill-conditioned systems. Adding a ridge term λI\lambda I to AA reduces κ(A+λI)\kappa(A + \lambda I) from σmax/σmin\sigma_{\max}/\sigma_{\min} to (σmax+λ)/(σmin+λ)(\sigma_{\max} + \lambda)/(\sigma_{\min} + \lambda).

Failure Mode

The bound is tight: there exist perturbations δb\delta b that achieve equality. But for "typical" perturbations, the actual error may be much smaller than the bound suggests. The condition number is a worst-case measure. Also, the bound applies to the problem, not to any specific algorithm. A backward-stable algorithm (like LU with partial pivoting) solves a nearby problem exactly, but whether the nearby problem has a nearby solution depends on κ(A)\kappa(A).

Iterative Methods

Conjugate Gradient (CG)

For SPD systems Ax=bAx = b, CG generates iterates xkx_k that minimize xxkA=(xxk)A(xxk)\|x - x_k\|_A = \sqrt{(x-x_k)^\top A(x-x_k)} over the Krylov subspace Kk(A,b)=span{b,Ab,,Ak1b}\mathcal{K}_k(A, b) = \text{span}\{b, Ab, \ldots, A^{k-1}b\}. CG converges in at most nn iterations in exact arithmetic. In practice, convergence rate depends on κ(A)\kappa(A):

xxkA2(κ(A)1κ(A)+1)kxx0A\|x - x_k\|_A \leq 2\left(\frac{\sqrt{\kappa(A)} - 1}{\sqrt{\kappa(A)} + 1}\right)^k \|x - x_0\|_A

Preconditioning (replacing Ax=bAx = b with M1Ax=M1bM^{-1}Ax = M^{-1}b for a good approximation MAM \approx A) reduces the effective condition number.

GMRES

For general (non-symmetric) systems, the Generalized Minimal Residual method (GMRES) minimizes bAxk2\|b - Ax_k\|_2 over the Krylov subspace. Each iteration requires storing one additional basis vector, so memory grows linearly with iteration count. Restarted GMRES (GMRES(mm)) limits memory by restarting after mm iterations.

Eigenvalue and SVD Computation

Power Iteration

To find the dominant eigenvalue of AA: start with v0v_0, iterate vk+1=Avk/Avkv_{k+1} = Av_k / \|Av_k\|. Converges to the eigenvector for the largest λ|\lambda| at rate λ2/λ1|\lambda_2/\lambda_1| per iteration. Cheap per iteration (O(n2)O(n^2) for dense, O(nnz)O(\text{nnz}) for sparse) but slow if the eigenvalue gap is small.

QR Algorithm

The standard method for computing all eigenvalues of a dense matrix. Apply shifted QR iterations: at step kk, compute AkμkI=QkRkA_k - \mu_k I = Q_k R_k, then set Ak+1=RkQk+μkIA_{k+1} = R_k Q_k + \mu_k I. With implicit shifts and deflation, converges in O(n3)O(n^3) total. First reduce to Hessenberg form (O(n3)O(n^3)), then each QR step costs O(n2)O(n^2).

SVD Computation

The SVD A=UΣVA = U\Sigma V^\top is computed via the Golub-Kahan bidiagonalization: reduce AA to bidiagonal form, then apply the QR algorithm to the bidiagonal matrix. Total cost: O(mn2)O(mn^2) for m×nm \times n with mnm \geq n. Truncated SVD (only the top kk singular values) can be computed iteratively in O(mnk)O(mnk) using Lanczos or randomized methods.

Numerical Stability

Why Naive Gram-Schmidt Fails

Classical Gram-Schmidt orthogonalization computes qk=akj<k(qjak)qjq_k = a_k - \sum_{j<k} (q_j^\top a_k) q_j and normalizes. In floating point, the computed qkq_k can lose orthogonality badly. If κ(A)108\kappa(A) \approx 10^8 and machine precision is 101610^{-16}, the computed QQ may have QQI108\|Q^\top Q - I\| \approx 10^{-8}. Modified Gram-Schmidt subtracts projections one at a time (re-orthogonalizing against previously computed vectors) and achieves QQI=O(κ(A)ϵmach)\|Q^\top Q - I\| = O(\kappa(A) \epsilon_{\text{mach}}). Householder QR achieves O(ϵmach)O(\epsilon_{\text{mach}}) regardless of conditioning.

Pivoting in LU

Without pivoting, LU factorization can amplify roundoff by a factor exponential in nn. Partial pivoting (swapping rows to put the largest entry in the pivot position) reduces the growth factor to at most 2n12^{n-1} in theory, but in practice keeps it near 1 for all but pathological matrices.

Canonical Examples

Example

Least squares: QR vs normal equations

To solve minxAxb2\min_x \|Ax - b\|^2 with ARm×nA \in \mathbb{R}^{m \times n}, m>nm > n:

Normal equations: Form AAA^\top A and solve AAx=AbA^\top A x = A^\top b via Cholesky. Cost: O(mn2+n3/6)O(mn^2 + n^3/6). Problem: κ(AA)=κ(A)2\kappa(A^\top A) = \kappa(A)^2. If κ(A)=108\kappa(A) = 10^8, solving the normal equations loses all 16 digits of double precision.

QR: Factor A=QRA = QR and solve Rx=QbRx = Q^\top b. Cost: O(2mn2)O(2mn^2). The condition number of RR is κ(A)\kappa(A), not κ(A)2\kappa(A)^2. This is the numerically stable choice when AA is ill-conditioned.

Common Confusions

Watch Out

Direct methods are not always slower than iterative methods

For dense n×nn \times n systems, Cholesky costs O(n3/6)O(n^3/6) and is hard to beat. Iterative methods shine when nn is large and the matrix is sparse (so AvAv costs O(nnz)O(\text{nnz}) instead of O(n2)O(n^2)) or when you only need an approximate solution. For a dense 1000×10001000 \times 1000 SPD system, Cholesky is faster than CG.

Watch Out

Condition number is a property of the problem, not the algorithm

A backward-stable algorithm solves a slightly perturbed problem exactly. Whether this perturbed problem has a nearby solution depends on κ(A)\kappa(A). No algorithm can overcome ill-conditioning; regularization (adding λI\lambda I to AA) changes the problem to a better-conditioned one.

Watch Out

SVD and eigendecomposition are different computations

SVD applies to any m×nm \times n matrix and always exists. Eigendecomposition requires a square matrix and may not exist (defective matrices). For symmetric matrices, SVD and eigendecomposition coincide (singular values = absolute eigenvalues). For non-symmetric or rectangular matrices, they are different.

Summary

  • LU: O(n3/3)O(n^3/3), needs pivoting, works for general square systems
  • Cholesky: O(n3/6)O(n^3/6), SPD only, no pivoting needed, fastest direct method
  • QR: O(2mn2)O(2mn^2), best for least squares, avoids squaring the condition number
  • CG: iterative, SPD only, convergence depends on κ(A)\sqrt{\kappa(A)}
  • κ(A)=σmax/σmin\kappa(A) = \sigma_{\max}/\sigma_{\min}: large means the problem amplifies errors
  • Regularization (A+λIA + \lambda I) reduces the condition number

Exercises

ExerciseCore

Problem

A 3×33 \times 3 SPD matrix has eigenvalues λ1=100\lambda_1 = 100, λ2=10\lambda_2 = 10, λ3=0.01\lambda_3 = 0.01. What is κ2(A)\kappa_2(A)? If the right-hand side bb has a relative error of 101010^{-10}, what is the worst-case relative error in the solution xx?

ExerciseAdvanced

Problem

Show that the convergence rate of conjugate gradient on an SPD system Ax=bAx = b satisfies the bound xxkA2((κ1)/(κ+1))kxx0A\|x - x_k\|_A \leq 2((\sqrt{\kappa} - 1)/(\sqrt{\kappa} + 1))^k \|x - x_0\|_A. How many CG iterations are needed to reduce the error by a factor of 10610^{-6} when κ(A)=104\kappa(A) = 10^4?

References

Canonical:

  • Trefethen & Bau, Numerical Linear Algebra (1997), Chapters 10-40
  • Golub & Van Loan, Matrix Computations (2013), Chapters 3-8

Current:

  • Nocedal & Wright, Numerical Optimization (2006), Chapter 5 (CG for optimization)
  • Halko, Martinsson, Tropp, "Finding Structure with Randomness" (2011)

Next Topics

  • Conjugate gradient methods: CG for optimization (not just linear systems)
  • Randomized linear algebra: sketching and randomized SVD for large-scale problems

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics