Numerical Optimization
Nonlinear Gauss-Seidel
Block coordinate descent with Newton-like updates within each block. Converges under contraction conditions for structured optimization problems. The EM algorithm is a special case.
Prerequisites
Why This Matters
Many optimization problems in ML have natural block structure. In matrix factorization, you alternate between updating and . In EM, you alternate between the E-step (computing posteriors) and the M-step (updating parameters). In ADMM, you alternate between primal variables. Nonlinear Gauss-Seidel formalizes this pattern: cycle through blocks, solving each subproblem exactly or approximately (e.g., with a few Newton steps), while holding the other blocks fixed.
Mental Model
Standard coordinate descent updates one scalar variable at a time. Block coordinate descent updates a group of variables (a "block") at a time. Nonlinear Gauss-Seidel is block coordinate descent where each block update solves a nonlinear subproblem, typically using Newton's method or a similar second-order method. The "Gauss-Seidel" name comes from linear algebra: the classic Gauss-Seidel method for updates each variable using the most recent values of all others.
Formal Setup
Block Decomposition
Partition the variable into blocks: where and . The objective is written as .
Nonlinear Gauss-Seidel Iteration
Given the current iterate , the nonlinear Gauss-Seidel update cycles through blocks :
Each block update uses the most recently computed values of all other blocks (Gauss-Seidel ordering), and the inner minimization is solved by Newton's method or another local solver.
Main Theorems
Convergence of Nonlinear Gauss-Seidel
Statement
Let be continuously differentiable, and suppose each block subproblem has a unique solution where denotes all blocks except . Define the Gauss-Seidel map applied sequentially. If is a contraction mapping with respect to some norm (i.e., for some ), then the iterates converge to the unique fixed point at a linear rate:
Intuition
Each block update makes progress by minimizing over its own variables. The contraction condition ensures that the combined effect of all block updates brings the iterate closer to the fixed point. The rate depends on how strongly coupled the blocks are: weakly coupled problems (nearly separable objectives) have small and fast convergence.
Proof Sketch
This is a direct application of the Banach fixed-point theorem. The Gauss-Seidel map is a composition of the individual block minimization maps. If is a contraction on a complete metric space, Banach guarantees a unique fixed point and geometric convergence to it.
Why It Matters
The contraction framework provides a clean sufficient condition for convergence that applies broadly. In practice, verifying the contraction condition reduces to checking that the cross-block coupling (measured by off-diagonal blocks of the Hessian) is small relative to the within-block curvature (measured by diagonal blocks of the Hessian).
Failure Mode
The method can cycle or diverge if the contraction condition fails. This happens when blocks are strongly coupled: updating one block substantially changes the optimal value of another block. For non-convex problems, even if each subproblem is convex, the overall map may not be a contraction, and the method may converge to different points depending on initialization or block ordering.
Connection to EM Algorithm
The expectation-maximization (EM) algorithm is a form of block coordinate ascent on the evidence lower bound (ELBO):
- E-step: maximize over (the posterior approximation) with fixed. Solution: .
- M-step: maximize over with fixed.
This is exactly nonlinear Gauss-Seidel with two blocks: . The E-step is a block minimization over ; the M-step is a block minimization over . Convergence of EM follows from the general convergence theory when the EM map is a contraction.
Connection to ADMM
The alternating direction method of multipliers (ADMM) also alternates between blocks but adds a penalty term to couple the blocks. ADMM can be viewed as a nonlinear Gauss-Seidel method on the augmented Lagrangian, with blocks being the primal variables and the dual variable.
Common Confusions
Block coordinate descent is not the same as gradient descent on blocks
Block coordinate descent exactly minimizes over each block. Block coordinate gradient descent takes a gradient step on each block. The former has stronger per-iteration progress but requires solving a subproblem. The latter is cheaper per iteration but may require more iterations. Nonlinear Gauss-Seidel refers to exact block minimization (or a close approximation via Newton steps).
Gauss-Seidel ordering matters
In Gauss-Seidel, block uses the updated value of block from the current iteration. In Jacobi ordering, all blocks update simultaneously using values from the previous iteration. Gauss-Seidel typically converges faster because it uses the most recent information, but Jacobi is easier to parallelize.
Exercises
Problem
Consider minimizing . Write out one full Gauss-Seidel iteration starting from .
Problem
For the EM algorithm applied to a Gaussian mixture model, identify the two "blocks" and explain why the EM map is guaranteed to increase the log-likelihood at each step but is not guaranteed to be a contraction.
References
Canonical:
- Bertsekas, Nonlinear Programming, 3rd ed., Section 2.7
- Ortega & Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Chapter 10
Current:
- Wright, "Coordinate Descent Algorithms," Mathematical Programming (2015)
- Xu & Yin, "A Block Coordinate Descent Method for Regularized Multi-Convex Optimization," SIAM J. Optimization (2013)
Next Topics
- Augmented Lagrangian and ADMM: extends block methods with dual variable updates for constrained problems
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Coordinate DescentLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Newton's MethodLayer 1