Numerical Optimization
Conjugate Gradient Methods
Solving Ax = b when A is symmetric positive definite by choosing A-conjugate search directions. Exact convergence in n steps, condition-number-dependent rates, nonlinear CG variants, and the connection to Krylov subspaces.
Prerequisites
Why This Matters
Many problems in ML reduce to solving large linear systems where is symmetric positive definite (SPD). Examples: solving the normal equations in least squares, computing natural gradient steps, solving preconditioned systems in second-order optimization, and inverting kernel matrices.
Direct methods (Cholesky factorization) cost and require forming explicitly. Conjugate gradient solves using only matrix-vector products , never forming . For sparse or structured matrices where is cheap, CG is dramatically faster. It converges in at most iterations (exact arithmetic) but typically reaches a good approximate solution in far fewer.
Mental Model
Steepest descent moves in the direction of the negative gradient at each step. The problem: it revisits the same directions repeatedly, zigzagging toward the solution. CG eliminates this waste by choosing directions that are -conjugate: each new direction is "orthogonal" with respect to the -inner product to all previous directions. This means each direction extracts maximum progress without undoing work from previous steps.
Formal Setup
A-Conjugate Directions
Vectors are A-conjugate (or conjugate with respect to ) if for all . When , this reduces to ordinary orthogonality.
Krylov Subspace
The Krylov subspace of order is:
where is the initial residual. CG finds the solution in that minimizes the -norm of the error.
The CG Algorithm
Given SPD matrix , right-hand side , initial guess :
- Set (initial residual) and (initial search direction).
- For :
- Step length:
- Update solution:
- Update residual:
- Conjugate direction coefficient:
- New search direction:
Each iteration requires one matrix-vector product (), two inner products, and three vector updates. Storage: only , , , and the product .
Main Theorems
CG Finite Termination
Statement
The conjugate gradient algorithm terminates in at most iterations with the exact solution . Equivalently, .
Intuition
The directions are A-conjugate and span . Each iteration solves the problem exactly in one new dimension. After dimensions, the entire space is covered.
Proof Sketch
The A-conjugacy of the search directions can be proved by induction using the three-term recurrence. Since is , at most A-conjugate directions can exist. The minimization property ensures is the minimizer of over . At , (for generic ), so .
Why It Matters
Finite termination distinguishes CG from generic iterative methods (which converge asymptotically but never reach the exact solution). In practice, finite termination is destroyed by floating-point errors, but it explains why CG converges so much faster than steepest descent.
Failure Mode
In floating-point arithmetic, rounding errors cause loss of A-conjugacy among the search directions. After about iterations, the directions are no longer conjugate and the residual may stagnate. Preconditioning and reorthogonalization are standard remedies.
CG Convergence Rate
Statement
After iterations:
where is the -norm and .
Intuition
The convergence rate depends on , not itself. This is a quadratic improvement over steepest descent, which depends on . For a matrix with condition number , steepest descent reduces the error by a factor of per step; CG reduces it by per step.
Proof Sketch
CG minimizes over all . This is equivalent to minimizing over all polynomials of degree with . The Chebyshev polynomials solve this min-max problem on , giving the stated bound.
Why It Matters
This bound explains why preconditioning is so important: replacing by (where is cheap to invert) reduces and accelerates convergence. It also explains why CG is fast for matrices with clustered eigenvalues, even if the condition number is large.
Failure Mode
The bound is pessimistic when eigenvalues are clustered. If has only distinct eigenvalues, CG converges in exactly steps (exact arithmetic). The Chebyshev bound captures worst-case behavior over and does not exploit clustering.
Nonlinear Conjugate Gradient
CG can be extended to minimize a general smooth function , not just the quadratic . Replace the residual with and compute by line search.
The key difference is the choice of :
- Fletcher-Reeves:
- Polak-Ribiere:
Polak-Ribiere is generally preferred because it automatically resets to steepest descent when progress stalls ( when consecutive gradients are similar). Fletcher-Reeves can generate poor search directions after an imprecise line search.
Neither variant retains the -step finite termination property for nonquadratic objectives. Periodic restarts (every or steps) are common.
Common Confusions
CG is not gradient descent with momentum
Both use information from previous iterations, but the mechanisms differ. CG chooses A-conjugate directions (which are optimal for quadratics). Momentum methods use an exponential moving average of past gradients. For a quadratic objective, CG converges in at most steps; gradient descent with momentum does not have this property.
CG requires symmetry and positive definiteness
The standard CG algorithm solves only when is SPD. For non-symmetric or indefinite systems, use GMRES, BiCGSTAB, or MINRES. For symmetric indefinite systems, MINRES is the appropriate Krylov method.
Summary
- CG solves for SPD using only matrix-vector products
- Converges in at most iterations (exact arithmetic), often much fewer
- Error bound depends on : quadratically better than steepest descent
- Preconditioning reduces and accelerates convergence
- Nonlinear CG (Fletcher-Reeves, Polak-Ribiere) extends to general smooth optimization
Exercises
Problem
A SPD matrix has condition number . Using the CG convergence bound, how many iterations are needed to reduce the -norm error by a factor of ?
Problem
Prove that if has exactly distinct eigenvalues, CG converges in at most iterations (exact arithmetic).
References
Canonical:
- Hestenes & Stiefel, Methods of Conjugate Gradients for Solving Linear Systems (1952)
- Trefethen & Bau, Numerical Linear Algebra, Lectures 38-39
- Nocedal & Wright, Numerical Optimization, Chapter 5
Current:
- Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain (1994), a widely used tutorial
Next Topics
Understanding CG provides the foundation for preconditioned iterative methods and second-order optimization in large-scale ML.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Line Search MethodsLayer 2