Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Coordinate Descent

Optimize by updating one coordinate (or block) at a time while holding others fixed. The default solver for Lasso because each coordinate update has a closed-form solution.

CoreTier 2Stable~45 min

Why This Matters

Coordinate descent is the workhorse behind the most widely used Lasso solvers (glmnet, scikit-learn's Lasso). When the nonsmooth penalty is separable across coordinates, each coordinate update has a closed-form solution. This makes coordinate descent extremely fast for high-dimensional sparse problems, often faster than proximal gradient methods.

Mental Model

Instead of moving in all dd directions at once (full gradient step), pick one coordinate axis and slide along it to the optimal point. Then pick another coordinate and repeat. Each step is cheap (a one-dimensional optimization) and for many problems it has an exact solution.

Think of it like tuning a guitar: you adjust one string at a time, cycling through all strings, and eventually the whole instrument is in tune.

Formal Setup and Notation

We want to minimize:

minxRdF(x)=f(x)+j=1dgj(xj)\min_{x \in \mathbb{R}^d} F(x) = f(x) + \sum_{j=1}^{d} g_j(x_j)

where ff is convex and smooth, and each gjg_j is convex and acts on a single coordinate. The separability of gg is what makes coordinate descent efficient.

Definition

Coordinate Update Rule

At iteration kk, pick coordinate jj and update:

xjk+1=argminz  F(x1k+1,,xj1k+1,z,xj+1k,,xdk)x_j^{k+1} = \arg\min_{z} \; F(x_1^{k+1}, \ldots, x_{j-1}^{k+1}, z, x_{j+1}^k, \ldots, x_d^k)

All other coordinates stay fixed. This is a one-dimensional optimization problem in zz.

Definition

Cyclic Coordinate Descent

In cyclic (or Gauss-Seidel) coordinate descent, you cycle through coordinates in a fixed order j=1,2,,d,1,2,j = 1, 2, \ldots, d, 1, 2, \ldots One full pass through all dd coordinates is called an epoch.

Definition

Randomized Coordinate Descent

In randomized coordinate descent, at each step you pick coordinate jj uniformly at random from {1,,d}\{1, \ldots, d\}. This version has cleaner convergence theory because each step is an unbiased estimator of progress.

Core Definitions

For the Lasso problem minx12Axb2+λx1\min_x \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1, the coordinate update for coordinate jj is:

xjSλ ⁣(1ajTaj(ajT(bAjxj)))x_j \leftarrow S_\lambda\!\left(\frac{1}{a_j^T a_j}\left(a_j^T(b - A_{-j} x_{-j})\right)\right)

where aja_j is the jj-th column of AA, AjA_{-j} and xjx_{-j} denote all other columns and coordinates, and Sλ(z)=sign(z)max(zλ,0)S_\lambda(z) = \mathrm{sign}(z)\max(|z| - \lambda, 0) is the soft-thresholding operator.

This closed-form update is why coordinate descent is the default for Lasso. No step size tuning is needed.

Block coordinate descent generalizes this: instead of updating a single coordinate, update a block of coordinates xBx_B at a time. This is useful when the objective has natural block structure (e.g., group lasso).

Gauss-Seidel for solving Ax=bAx = b is cyclic coordinate descent applied to minx12xTAxbTx\min_x \frac{1}{2}x^TAx - b^Tx (when AA is positive definite). Each coordinate update is xj(bjijAijxi)/Ajjx_j \leftarrow (b_j - \sum_{i \neq j} A_{ij}x_i)/A_{jj}.

Main Theorems

Theorem

Convergence of Cyclic Coordinate Descent

Statement

Let F(x)=f(x)+jgj(xj)F(x) = f(x) + \sum_j g_j(x_j) where ff is convex with coordinate-wise Lipschitz gradients LjL_j. Cyclic coordinate descent produces iterates satisfying:

F(xk)F as kF(x^k) \to F^* \text{ as } k \to \infty

For randomized coordinate descent with step size 1/Lj1/L_j for coordinate jj:

E[F(xk)]F2dmaxjLjx0x2k\mathbb{E}[F(x^k)] - F^* \leq \frac{2d \cdot \max_j L_j \cdot \|x^0 - x^*\|^2}{k}

Intuition

Each coordinate update can only decrease the objective (or leave it unchanged). The separability of gg ensures no coupling between coordinates in the nonsmooth part. The smooth part ff provides enough curvature that progress on individual coordinates translates to global progress. The dd factor in the rate reflects that each epoch touches dd coordinates.

Proof Sketch

For the randomized case: at each step, the expected decrease from updating a random coordinate is 1d\frac{1}{d} of the decrease you would get from a full gradient step. Apply the standard descent lemma coordinate-wise and take expectations. Telescope over kk steps to get the O(d/k)O(d/k) rate per iteration, which is O(1/k)O(1/k) per epoch.

Why It Matters

The per-epoch rate matches gradient descent, but each epoch of coordinate descent can be cheaper if the coordinate updates are fast. For Lasso, each coordinate update is O(n)O(n) and an epoch is O(nd)O(nd), the same as one gradient step but with a much smaller constant because it avoids matrix multiplications.

Failure Mode

Cyclic coordinate descent can fail to converge for non-separable nonsmooth gg. The classic counterexample is minimizing x1x2|x_1 - x_2|: cyclic updates leave each coordinate unchanged if the other is fixed. For non-separable penalties, use proximal gradient or ADMM instead.

Canonical Examples

Example

Lasso coordinate update

For AR100×1000A \in \mathbb{R}^{100 \times 1000}, bR100b \in \mathbb{R}^{100}, and λ=0.1\lambda = 0.1: precompute ATAA^TA and ATbA^Tb. Each coordinate update reads one row of ATAA^TA and one entry of ATbA^Tb. A full epoch scans all 1000 coordinates. In practice, active set strategies skip coordinates already at zero, making each epoch even cheaper.

Example

Gauss-Seidel for linear systems

To solve Ax=bAx = b with AA positive definite, iterate: xj(bjijAijxi)/Ajjx_j \leftarrow (b_j - \sum_{i \neq j} A_{ij} x_i) / A_{jj} for j=1,,dj = 1, \ldots, d. This converges if AA is strictly diagonally dominant or symmetric positive definite.

Common Confusions

Watch Out

Coordinate descent is not gradient descent with a random mask

Coordinate descent solves the one-dimensional subproblem exactly (or near exactly). Simply zeroing out all but one component of the gradient and taking a step is a different algorithm with potentially worse behavior. The exact minimization along each coordinate is what gives coordinate descent its good properties.

Watch Out

Cyclic vs randomized: theory vs practice

Randomized coordinate descent has cleaner theory (O(1/k)O(1/k) rate with nice constants). But cyclic coordinate descent often works better in practice because it ensures every coordinate gets updated each epoch. The theory for cyclic is harder and was settled only recently.

Summary

  • Update one coordinate at a time; each step is a one-dimensional problem
  • Separable penalties (like 1\ell_1) give closed-form coordinate updates
  • Cyclic CD is the default Lasso solver; no step size to tune
  • Randomized CD has cleaner theory: O(d/k)O(d/k) per iteration, O(1/k)O(1/k) per epoch
  • Block CD generalizes to groups of coordinates
  • Gauss-Seidel is CD applied to quadratic objectives

Exercises

ExerciseCore

Problem

Write out the coordinate descent update for coordinate jj of the Lasso problem minx12Axb2+λx1\min_x \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1 in terms of the columns of AA and the current residual r=bAxr = b - Ax.

ExerciseAdvanced

Problem

Give a two-dimensional example where cyclic coordinate descent on a nonsmooth, non-separable function fails to find the minimum. What goes wrong geometrically?

References

Canonical:

  • Tseng, "Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization" (2001)
  • Wright, "Coordinate Descent Algorithms," Mathematical Programming (2015)

Current:

  • Friedman, Hastie, Tibshirani, "Regularization Paths for GLMs via Coordinate Descent" (2010)

Next Topics

The natural next steps from coordinate descent:

  • Proximal gradient methods: for non-separable penalties
  • Stochastic gradient descent: randomness in samples rather than coordinates

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics