Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Nonlinear Gauss-Seidel

Block coordinate descent with Newton-like updates within each block. Converges under contraction conditions for structured optimization problems. The EM algorithm is a special case.

AdvancedTier 3Stable~35 min
0

Why This Matters

Many optimization problems in ML have natural block structure. In matrix factorization, you alternate between updating WW and HH. In EM, you alternate between the E-step (computing posteriors) and the M-step (updating parameters). In ADMM, you alternate between primal variables. Nonlinear Gauss-Seidel formalizes this pattern: cycle through blocks, solving each subproblem exactly or approximately (e.g., with a few Newton steps), while holding the other blocks fixed.

Mental Model

Standard coordinate descent updates one scalar variable at a time. Block coordinate descent updates a group of variables (a "block") at a time. Nonlinear Gauss-Seidel is block coordinate descent where each block update solves a nonlinear subproblem, typically using Newton's method or a similar second-order method. The "Gauss-Seidel" name comes from linear algebra: the classic Gauss-Seidel method for Ax=bAx = b updates each variable using the most recent values of all others.

Formal Setup

Definition

Block Decomposition

Partition the variable xRnx \in \mathbb{R}^n into mm blocks: x=(x1,x2,,xm)x = (x_1, x_2, \ldots, x_m) where xiRnix_i \in \mathbb{R}^{n_i} and ini=n\sum_i n_i = n. The objective f(x)f(x) is written as f(x1,x2,,xm)f(x_1, x_2, \ldots, x_m).

Definition

Nonlinear Gauss-Seidel Iteration

Given the current iterate x(k)=(x1(k),,xm(k))x^{(k)} = (x_1^{(k)}, \ldots, x_m^{(k)}), the nonlinear Gauss-Seidel update cycles through blocks i=1,,mi = 1, \ldots, m:

xi(k+1)=argminxif(x1(k+1),,xi1(k+1),xi,xi+1(k),,xm(k))x_i^{(k+1)} = \arg\min_{x_i} f(x_1^{(k+1)}, \ldots, x_{i-1}^{(k+1)}, x_i, x_{i+1}^{(k)}, \ldots, x_m^{(k)})

Each block update uses the most recently computed values of all other blocks (Gauss-Seidel ordering), and the inner minimization is solved by Newton's method or another local solver.

Main Theorems

Theorem

Convergence of Nonlinear Gauss-Seidel

Statement

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be continuously differentiable, and suppose each block subproblem minxif(x1,,xi,,xm)\min_{x_i} f(x_1, \ldots, x_i, \ldots, x_m) has a unique solution xi=gi(xi)x_i^* = g_i(x_{-i}) where xix_{-i} denotes all blocks except ii. Define the Gauss-Seidel map G(x)=(g1,g2,,gm)G(x) = (g_1, g_2, \ldots, g_m) applied sequentially. If GG is a contraction mapping with respect to some norm (i.e., G(x)G(y)ρxy\|G(x) - G(y)\| \leq \rho \|x - y\| for some ρ<1\rho < 1), then the iterates converge to the unique fixed point xx^* at a linear rate:

x(k)xρkx(0)x\|x^{(k)} - x^*\| \leq \rho^k \|x^{(0)} - x^*\|

Intuition

Each block update makes progress by minimizing over its own variables. The contraction condition ensures that the combined effect of all block updates brings the iterate closer to the fixed point. The rate ρ\rho depends on how strongly coupled the blocks are: weakly coupled problems (nearly separable objectives) have small ρ\rho and fast convergence.

Proof Sketch

This is a direct application of the Banach fixed-point theorem. The Gauss-Seidel map GG is a composition of the individual block minimization maps. If GG is a contraction on a complete metric space, Banach guarantees a unique fixed point and geometric convergence to it.

Why It Matters

The contraction framework provides a clean sufficient condition for convergence that applies broadly. In practice, verifying the contraction condition reduces to checking that the cross-block coupling (measured by off-diagonal blocks of the Hessian) is small relative to the within-block curvature (measured by diagonal blocks of the Hessian).

Failure Mode

The method can cycle or diverge if the contraction condition fails. This happens when blocks are strongly coupled: updating one block substantially changes the optimal value of another block. For non-convex problems, even if each subproblem is convex, the overall map may not be a contraction, and the method may converge to different points depending on initialization or block ordering.

Connection to EM Algorithm

The expectation-maximization (EM) algorithm is a form of block coordinate ascent on the evidence lower bound (ELBO):

L(q,θ)=Eq[logp(x,zθ)]+H(q)\mathcal{L}(q, \theta) = \mathbb{E}_q[\log p(x, z | \theta)] + H(q)

  • E-step: maximize L\mathcal{L} over qq (the posterior approximation) with θ\theta fixed. Solution: q(z)=p(zx,θ)q(z) = p(z | x, \theta).
  • M-step: maximize L\mathcal{L} over θ\theta with qq fixed.

This is exactly nonlinear Gauss-Seidel with two blocks: (q,θ)(q, \theta). The E-step is a block minimization over qq; the M-step is a block minimization over θ\theta. Convergence of EM follows from the general convergence theory when the EM map is a contraction.

Connection to ADMM

The alternating direction method of multipliers (ADMM) also alternates between blocks but adds a penalty term to couple the blocks. ADMM can be viewed as a nonlinear Gauss-Seidel method on the augmented Lagrangian, with blocks being the primal variables and the dual variable.

Common Confusions

Watch Out

Block coordinate descent is not the same as gradient descent on blocks

Block coordinate descent exactly minimizes over each block. Block coordinate gradient descent takes a gradient step on each block. The former has stronger per-iteration progress but requires solving a subproblem. The latter is cheaper per iteration but may require more iterations. Nonlinear Gauss-Seidel refers to exact block minimization (or a close approximation via Newton steps).

Watch Out

Gauss-Seidel ordering matters

In Gauss-Seidel, block i+1i+1 uses the updated value of block ii from the current iteration. In Jacobi ordering, all blocks update simultaneously using values from the previous iteration. Gauss-Seidel typically converges faster because it uses the most recent information, but Jacobi is easier to parallelize.

Exercises

ExerciseCore

Problem

Consider minimizing f(x1,x2)=x12+x22+0.5x1x2f(x_1, x_2) = x_1^2 + x_2^2 + 0.5 x_1 x_2. Write out one full Gauss-Seidel iteration starting from (x1(0),x2(0))=(2,2)(x_1^{(0)}, x_2^{(0)}) = (2, 2).

ExerciseAdvanced

Problem

For the EM algorithm applied to a Gaussian mixture model, identify the two "blocks" and explain why the EM map is guaranteed to increase the log-likelihood at each step but is not guaranteed to be a contraction.

References

Canonical:

  • Bertsekas, Nonlinear Programming, 3rd ed., Section 2.7
  • Ortega & Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Chapter 10

Current:

  • Wright, "Coordinate Descent Algorithms," Mathematical Programming (2015)
  • Xu & Yin, "A Block Coordinate Descent Method for Regularized Multi-Convex Optimization," SIAM J. Optimization (2013)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics