Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Conjugate Gradient Methods

Solving Ax = b when A is symmetric positive definite by choosing A-conjugate search directions. Exact convergence in n steps, condition-number-dependent rates, nonlinear CG variants, and the connection to Krylov subspaces.

AdvancedTier 2Stable~50 min
0

Why This Matters

Many problems in ML reduce to solving large linear systems Ax=bAx = b where AA is symmetric positive definite (SPD). Examples: solving the normal equations in least squares, computing natural gradient steps, solving preconditioned systems in second-order optimization, and inverting kernel matrices.

Direct methods (Cholesky factorization) cost O(n3)O(n^3) and require forming AA explicitly. Conjugate gradient solves Ax=bAx = b using only matrix-vector products AvAv, never forming AA. For sparse or structured matrices where AvAv is cheap, CG is dramatically faster. It converges in at most nn iterations (exact arithmetic) but typically reaches a good approximate solution in far fewer.

Mental Model

Steepest descent moves in the direction of the negative gradient at each step. The problem: it revisits the same directions repeatedly, zigzagging toward the solution. CG eliminates this waste by choosing directions that are AA-conjugate: each new direction is "orthogonal" with respect to the AA-inner product to all previous directions. This means each direction extracts maximum progress without undoing work from previous steps.

Formal Setup

Definition

A-Conjugate Directions

Vectors d0,d1,,dkd_0, d_1, \ldots, d_{k} are A-conjugate (or conjugate with respect to AA) if diAdj=0d_i^\top A d_j = 0 for all iji \neq j. When A=IA = I, this reduces to ordinary orthogonality.

Definition

Krylov Subspace

The Krylov subspace of order kk is:

Kk(A,r0)=span{r0,Ar0,A2r0,,Ak1r0}\mathcal{K}_k(A, r_0) = \text{span}\{r_0, Ar_0, A^2 r_0, \ldots, A^{k-1} r_0\}

where r0=bAx0r_0 = b - Ax_0 is the initial residual. CG finds the solution in x0+Kk(A,r0)x_0 + \mathcal{K}_k(A, r_0) that minimizes the AA-norm of the error.

The CG Algorithm

Given SPD matrix AA, right-hand side bb, initial guess x0x_0:

  1. Set r0=bAx0r_0 = b - Ax_0 (initial residual) and d0=r0d_0 = r_0 (initial search direction).
  2. For k=0,1,2,k = 0, 1, 2, \ldots:
    • Step length: αk=rkrk/(dkAdk)\alpha_k = r_k^\top r_k / (d_k^\top A d_k)
    • Update solution: xk+1=xk+αkdkx_{k+1} = x_k + \alpha_k d_k
    • Update residual: rk+1=rkαkAdkr_{k+1} = r_k - \alpha_k A d_k
    • Conjugate direction coefficient: βk+1=rk+1rk+1/(rkrk)\beta_{k+1} = r_{k+1}^\top r_{k+1} / (r_k^\top r_k)
    • New search direction: dk+1=rk+1+βk+1dkd_{k+1} = r_{k+1} + \beta_{k+1} d_k

Each iteration requires one matrix-vector product (AdkAd_k), two inner products, and three vector updates. Storage: only xkx_k, rkr_k, dkd_k, and the product AdkAd_k.

Main Theorems

Theorem

CG Finite Termination

Statement

The conjugate gradient algorithm terminates in at most nn iterations with the exact solution x=A1bx^* = A^{-1}b. Equivalently, rn=0r_n = 0.

Intuition

The directions d0,,dn1d_0, \ldots, d_{n-1} are A-conjugate and span Rn\mathbb{R}^n. Each iteration solves the problem exactly in one new dimension. After nn dimensions, the entire space is covered.

Proof Sketch

The A-conjugacy of the search directions d0,,dk1d_0, \ldots, d_{k-1} can be proved by induction using the three-term recurrence. Since AA is n×nn \times n, at most nn A-conjugate directions can exist. The minimization property ensures xkx_k is the minimizer of 12xAxbx\frac{1}{2}x^\top A x - b^\top x over x0+Kk(A,r0)x_0 + \mathcal{K}_k(A, r_0). At k=nk = n, Kn(A,r0)=Rn\mathcal{K}_n(A, r_0) = \mathbb{R}^n (for generic r0r_0), so xn=xx_n = x^*.

Why It Matters

Finite termination distinguishes CG from generic iterative methods (which converge asymptotically but never reach the exact solution). In practice, finite termination is destroyed by floating-point errors, but it explains why CG converges so much faster than steepest descent.

Failure Mode

In floating-point arithmetic, rounding errors cause loss of A-conjugacy among the search directions. After about nn iterations, the directions are no longer conjugate and the residual may stagnate. Preconditioning and reorthogonalization are standard remedies.

Theorem

CG Convergence Rate

Statement

After kk iterations:

xkxA2(κ1κ+1)kx0xA\|x_k - x^*\|_A \leq 2 \left(\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\right)^k \|x_0 - x^*\|_A

where vA=vAv\|v\|_A = \sqrt{v^\top A v} is the AA-norm and κ=λmax(A)/λmin(A)\kappa = \lambda_{\max}(A) / \lambda_{\min}(A).

Intuition

The convergence rate depends on κ\sqrt{\kappa}, not κ\kappa itself. This is a quadratic improvement over steepest descent, which depends on (κ1)/(κ+1)(\kappa - 1)/(\kappa + 1). For a matrix with condition number κ=10,000\kappa = 10{,}000, steepest descent reduces the error by a factor of 0.99990.9999 per step; CG reduces it by 0.980.98 per step.

Proof Sketch

CG minimizes xkxA\|x_k - x^*\|_A over all x0+Kk(A,r0)x_0 + \mathcal{K}_k(A, r_0). This is equivalent to minimizing pk(A)(x0x)A\|p_k(A)(x_0 - x^*)\|_A over all polynomials pkp_k of degree kk with pk(0)=1p_k(0) = 1. The Chebyshev polynomials solve this min-max problem on [λmin,λmax][\lambda_{\min}, \lambda_{\max}], giving the stated bound.

Why It Matters

This bound explains why preconditioning is so important: replacing AA by M1AM^{-1}A (where MAM \approx A is cheap to invert) reduces κ\kappa and accelerates convergence. It also explains why CG is fast for matrices with clustered eigenvalues, even if the condition number is large.

Failure Mode

The bound is pessimistic when eigenvalues are clustered. If AA has only rr distinct eigenvalues, CG converges in exactly rr steps (exact arithmetic). The Chebyshev bound captures worst-case behavior over [λmin,λmax][\lambda_{\min}, \lambda_{\max}] and does not exploit clustering.

Nonlinear Conjugate Gradient

CG can be extended to minimize a general smooth function f(x)f(x), not just the quadratic 12xAxbx\frac{1}{2}x^\top Ax - b^\top x. Replace the residual rkr_k with f(xk)-\nabla f(x_k) and compute αk\alpha_k by line search.

The key difference is the choice of βk\beta_k:

  • Fletcher-Reeves: βk=fk2/fk12\beta_k = \|\nabla f_k\|^2 / \|\nabla f_{k-1}\|^2
  • Polak-Ribiere: βk=fk(fkfk1)/fk12\beta_k = \nabla f_k^\top (\nabla f_k - \nabla f_{k-1}) / \|\nabla f_{k-1}\|^2

Polak-Ribiere is generally preferred because it automatically resets to steepest descent when progress stalls (βk0\beta_k \approx 0 when consecutive gradients are similar). Fletcher-Reeves can generate poor search directions after an imprecise line search.

Neither variant retains the nn-step finite termination property for nonquadratic objectives. Periodic restarts (every nn or n\sqrt{n} steps) are common.

Common Confusions

Watch Out

CG is not gradient descent with momentum

Both use information from previous iterations, but the mechanisms differ. CG chooses A-conjugate directions (which are optimal for quadratics). Momentum methods use an exponential moving average of past gradients. For a quadratic objective, CG converges in at most nn steps; gradient descent with momentum does not have this property.

Watch Out

CG requires symmetry and positive definiteness

The standard CG algorithm solves Ax=bAx = b only when AA is SPD. For non-symmetric or indefinite systems, use GMRES, BiCGSTAB, or MINRES. For symmetric indefinite systems, MINRES is the appropriate Krylov method.

Summary

  • CG solves Ax=bAx = b for SPD AA using only matrix-vector products
  • Converges in at most nn iterations (exact arithmetic), often much fewer
  • Error bound depends on κ\sqrt{\kappa}: quadratically better than steepest descent
  • Preconditioning reduces κ\kappa and accelerates convergence
  • Nonlinear CG (Fletcher-Reeves, Polak-Ribiere) extends to general smooth optimization

Exercises

ExerciseCore

Problem

A 1000×10001000 \times 1000 SPD matrix has condition number κ=100\kappa = 100. Using the CG convergence bound, how many iterations are needed to reduce the AA-norm error by a factor of 10610^{-6}?

ExerciseAdvanced

Problem

Prove that if AA has exactly rr distinct eigenvalues, CG converges in at most rr iterations (exact arithmetic).

References

Canonical:

  • Hestenes & Stiefel, Methods of Conjugate Gradients for Solving Linear Systems (1952)
  • Trefethen & Bau, Numerical Linear Algebra, Lectures 38-39
  • Nocedal & Wright, Numerical Optimization, Chapter 5

Current:

  • Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain (1994), a widely used tutorial

Next Topics

Understanding CG provides the foundation for preconditioned iterative methods and second-order optimization in large-scale ML.

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.