Numerical Optimization
Numerical Linear Algebra
Algorithms for solving linear systems, computing eigenvalues, and factoring matrices. Every linear regression, PCA, and SVD computation depends on these methods.
Why This Matters
Linear algebra is computed, not just stated. When you call np.linalg.solve, sklearn's linear regression, or torch.svd, the computer runs one of the algorithms described here. Choosing the right algorithm and understanding when numerical issues arise (ill-conditioning, loss of orthogonality, pivoting failures) is the difference between getting correct results and getting garbage.
Mental Model
There are two families of methods. Direct methods (LU, Cholesky, QR) produce exact answers (up to floating-point roundoff) in a fixed number of operations, typically . Iterative methods (conjugate gradient, GMRES, power iteration) produce a sequence of improving approximations and are preferred when is large and the matrix is sparse or structured.
Solving Linear Systems:
LU Factorization
Factor as where is lower triangular and is upper triangular. Then solve (forward substitution, ) followed by (back substitution, ). The factorization itself costs operations. In practice, use partial pivoting: where is a permutation matrix. Pivoting prevents division by small numbers and is needed for numerical stability.
Cholesky Factorization
If is symmetric positive definite (SPD), factor where is lower triangular with positive diagonal entries. Cost: , half of LU. The Cholesky factorization exists if and only if is SPD. It is the preferred method for solving SPD systems (e.g., normal equations in least squares, covariance matrices, kernel matrices).
QR Factorization
Factor (with ) as where has orthonormal columns and is upper triangular. The least squares solution to is . QR is more numerically stable than solving the normal equations via Cholesky, because .
Main Theorems
Existence and Uniqueness of Cholesky Factorization
Statement
If is symmetric positive definite, there exists a unique lower triangular matrix with positive diagonal entries such that .
Intuition
SPD matrices have all positive eigenvalues. The Cholesky factor is a "square root" of . Uniqueness comes from requiring positive diagonal entries, which pins down the sign ambiguity in any square root.
Proof Sketch
By induction on . Write in block form: where (since is SPD). Set , , and recurse on the Schur complement , which is also SPD. The positive diagonal follows from at each step.
Why It Matters
Cholesky is the fastest direct solver for SPD systems, twice as fast as LU and guaranteed to succeed without pivoting. In ML: kernel matrices, precision matrices, and the normal equations matrix are all SPD (assuming full rank). Cholesky factorization of is the standard method for ordinary least squares in many libraries.
Failure Mode
If is not positive definite (has a zero or negative eigenvalue), Cholesky will fail: the algorithm attempts to take the square root of a non-positive number. This is actually useful as a diagnostic: failed Cholesky factorization indicates the matrix is not SPD. Near-singular SPD matrices () will produce large entries in and amplify roundoff errors.
Condition Number and Perturbation Bound
Statement
For a nonsingular matrix , if and , then:
where is the condition number. For the 2-norm, .
Intuition
The condition number measures the worst-case amplification of relative errors. A small perturbation in can be amplified by up to in the solution . If , you lose about digits of accuracy in the solution.
Proof Sketch
, so . Also , so . Combining: .
Why It Matters
This bound explains why some linear systems are "hard" even though is invertible in exact arithmetic. In ML, the Hessian of the loss function can be ill-conditioned, making Newton's method unstable without regularization. Kernel matrices with small eigenvalues produce ill-conditioned systems. Adding a ridge term to reduces from to .
Failure Mode
The bound is tight: there exist perturbations that achieve equality. But for "typical" perturbations, the actual error may be much smaller than the bound suggests. The condition number is a worst-case measure. Also, the bound applies to the problem, not to any specific algorithm. A backward-stable algorithm (like LU with partial pivoting) solves a nearby problem exactly, but whether the nearby problem has a nearby solution depends on .
Iterative Methods
Conjugate Gradient (CG)
For SPD systems , CG generates iterates that minimize over the Krylov subspace . CG converges in at most iterations in exact arithmetic. In practice, convergence rate depends on :
Preconditioning (replacing with for a good approximation ) reduces the effective condition number.
GMRES
For general (non-symmetric) systems, the Generalized Minimal Residual method (GMRES) minimizes over the Krylov subspace. Each iteration requires storing one additional basis vector, so memory grows linearly with iteration count. Restarted GMRES (GMRES()) limits memory by restarting after iterations.
Eigenvalue and SVD Computation
Power Iteration
To find the dominant eigenvalue of : start with , iterate . Converges to the eigenvector for the largest at rate per iteration. Cheap per iteration ( for dense, for sparse) but slow if the eigenvalue gap is small.
QR Algorithm
The standard method for computing all eigenvalues of a dense matrix. Apply shifted QR iterations: at step , compute , then set . With implicit shifts and deflation, converges in total. First reduce to Hessenberg form (), then each QR step costs .
SVD Computation
The SVD is computed via the Golub-Kahan bidiagonalization: reduce to bidiagonal form, then apply the QR algorithm to the bidiagonal matrix. Total cost: for with . Truncated SVD (only the top singular values) can be computed iteratively in using Lanczos or randomized methods.
Numerical Stability
Why Naive Gram-Schmidt Fails
Classical Gram-Schmidt orthogonalization computes and normalizes. In floating point, the computed can lose orthogonality badly. If and machine precision is , the computed may have . Modified Gram-Schmidt subtracts projections one at a time (re-orthogonalizing against previously computed vectors) and achieves . Householder QR achieves regardless of conditioning.
Pivoting in LU
Without pivoting, LU factorization can amplify roundoff by a factor exponential in . Partial pivoting (swapping rows to put the largest entry in the pivot position) reduces the growth factor to at most in theory, but in practice keeps it near 1 for all but pathological matrices.
Canonical Examples
Least squares: QR vs normal equations
To solve with , :
Normal equations: Form and solve via Cholesky. Cost: . Problem: . If , solving the normal equations loses all 16 digits of double precision.
QR: Factor and solve . Cost: . The condition number of is , not . This is the numerically stable choice when is ill-conditioned.
Common Confusions
Direct methods are not always slower than iterative methods
For dense systems, Cholesky costs and is hard to beat. Iterative methods shine when is large and the matrix is sparse (so costs instead of ) or when you only need an approximate solution. For a dense SPD system, Cholesky is faster than CG.
Condition number is a property of the problem, not the algorithm
A backward-stable algorithm solves a slightly perturbed problem exactly. Whether this perturbed problem has a nearby solution depends on . No algorithm can overcome ill-conditioning; regularization (adding to ) changes the problem to a better-conditioned one.
SVD and eigendecomposition are different computations
SVD applies to any matrix and always exists. Eigendecomposition requires a square matrix and may not exist (defective matrices). For symmetric matrices, SVD and eigendecomposition coincide (singular values = absolute eigenvalues). For non-symmetric or rectangular matrices, they are different.
Summary
- LU: , needs pivoting, works for general square systems
- Cholesky: , SPD only, no pivoting needed, fastest direct method
- QR: , best for least squares, avoids squaring the condition number
- CG: iterative, SPD only, convergence depends on
- : large means the problem amplifies errors
- Regularization () reduces the condition number
Exercises
Problem
A SPD matrix has eigenvalues , , . What is ? If the right-hand side has a relative error of , what is the worst-case relative error in the solution ?
Problem
Show that the convergence rate of conjugate gradient on an SPD system satisfies the bound . How many CG iterations are needed to reduce the error by a factor of when ?
References
Canonical:
- Trefethen & Bau, Numerical Linear Algebra (1997), Chapters 10-40
- Golub & Van Loan, Matrix Computations (2013), Chapters 3-8
Current:
- Nocedal & Wright, Numerical Optimization (2006), Chapter 5 (CG for optimization)
- Halko, Martinsson, Tropp, "Finding Structure with Randomness" (2011)
Next Topics
- Conjugate gradient methods: CG for optimization (not just linear systems)
- Randomized linear algebra: sketching and randomized SVD for large-scale problems
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Eigenvalues and EigenvectorsLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A