Coordinate Descent

Sneiderman, Robby

Numerical Optimization

Coordinate Descent

Optimize by updating one coordinate (or block) at a time while holding others fixed. The default solver for Lasso because each coordinate update has a closed-form solution.

CoreTier 2StableSupporting~45 min

Prerequisites

Convex Optimization Basics Mirror Descent and Frank Wolfe Proximal Gradient Methods

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

numerical-optimization | layer 2 | tier 2. This page has 3 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Proximal Gradient Methods

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Coordinate descent is the workhorse behind the most widely used Lasso solvers (glmnet, scikit-learn's Lasso). When the nonsmooth penalty is separable across coordinates, each coordinate update has a closed-form solution. This makes coordinate descent extremely fast for high-dimensional sparse problems, often faster than proximal gradient methods.

Mental Model

Instead of moving in all $d$ directions at once (full gradient step), pick one coordinate axis and slide along it to the optimal point. Then pick another coordinate and repeat. Each step is cheap (a one-dimensional optimization) and for many problems it has an exact solution.

Think of it like tuning a guitar: you adjust one string at a time, cycling through all strings, and eventually the whole instrument is in tune.

Formal Setup and Notation

We want to minimize:

$\min_{x \in \mathbb{R}^d} F(x) = f(x) + \sum_{j=1}^{d} g_j(x_j)$

where $f$ is convex and smooth, and each $g_j$ is convex and acts on a single coordinate. The separability of $g$ is what makes coordinate descent efficient.

Definition

Coordinate Update Rule

At iteration $k$ , pick coordinate $j$ and update:

$x_j^{k+1} = \arg\min_{z} \; F(x_1^{k+1}, \ldots, x_{j-1}^{k+1}, z, x_{j+1}^k, \ldots, x_d^k)$

All other coordinates stay fixed. This is a one-dimensional optimization problem in $z$ .

Definition

Cyclic Coordinate Descent

In cyclic (or Gauss-Seidel) coordinate descent, you cycle through coordinates in a fixed order $j = 1, 2, \ldots, d, 1, 2, \ldots$ One full pass through all $d$ coordinates is called an epoch.

Definition

Randomized Coordinate Descent

In randomized coordinate descent, at each step you pick coordinate $j$ uniformly at random from $\{1, \ldots, d\}$ . This version has cleaner convergence theory because each step is an unbiased estimator of progress.

Core Definitions

For the Lasso problem $\min_x \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1$ , the coordinate update for coordinate $j$ is:

$x_j \leftarrow \frac{S_\lambda\!\left(a_j^T(b - A_{-j} x_{-j})\right)}{a_j^T a_j}$

where $a_j$ is the $j$ -th column of $A$ , $A_{-j}$ and $x_{-j}$ denote all other columns and coordinates, and $S_\lambda(z) = \mathrm{sign}(z)\max(|z| - \lambda, 0)$ is the soft-thresholding operator. The threshold $\lambda$ is applied to $a_j^T r_j$ before the column-norm normalization; equivalently, one can write $x_j \leftarrow S_{\lambda/(a_j^T a_j)}(a_j^T r_j / (a_j^T a_j))$ . When the columns of $A$ are normalized so that $a_j^T a_j = 1$ (the default preprocessing in glmnet and scikit-learn), both forms collapse to $x_j \leftarrow S_\lambda(a_j^T r_j)$ .

This closed-form update is why coordinate descent is the default for Lasso. No step size tuning is needed.

Block coordinate descent generalizes this: instead of updating a single coordinate, update a block of coordinates $x_B$ at a time. This is useful when the objective has natural block structure (e.g., group lasso).

Gauss-Seidel for solving $Ax = b$ is cyclic coordinate descent applied to $\min_x \frac{1}{2}x^TAx - b^Tx$ (when $A$ is positive definite). Each coordinate update is $x_j \leftarrow (b_j - \sum_{i \neq j} A_{ij}x_i)/A_{jj}$ .

Main Theorems

Theorem

Convergence of Cyclic Coordinate Descent

Statement

Let $F(x) = f(x) + \sum_j g_j(x_j)$ where $f$ is convex with coordinate-wise Lipschitz gradients $L_j$ . Cyclic coordinate descent produces iterates satisfying:

$F(x^k) \to F^* \text{ as } k \to \infty$

For randomized coordinate descent with step size $1/L_j$ for coordinate $j$ :

$\mathbb{E}[F(x^k)] - F^* \leq \frac{2d \cdot \max_j L_j \cdot \|x^0 - x^*\|^2}{k}$

Intuition

Each coordinate update can only decrease the objective (or leave it unchanged). The separability of $g$ ensures no coupling between coordinates in the nonsmooth part. The smooth part $f$ provides enough curvature that progress on individual coordinates translates to global progress. The $d$ factor in the rate reflects that each epoch touches $d$ coordinates.

Proof Sketch

For the randomized case: at each step, the expected decrease from updating a random coordinate is $\frac{1}{d}$ of the decrease you would get from a full gradient step. Apply the standard descent lemma coordinate-wise and take expectations. Telescope over $k$ steps to get the $O(d/k)$ rate per iteration, which is $O(1/k)$ per epoch.

Why It Matters

The per-epoch rate matches gradient descent, but each epoch of coordinate descent can be cheaper if the coordinate updates are fast. For Lasso, each coordinate update is $O(n)$ and an epoch is $O(nd)$ , the same as one gradient step but with a much smaller constant because it avoids matrix multiplications.

Failure Mode

Cyclic coordinate descent can fail to converge when the nonsmooth part is not separable across coordinates. A clean two-dimensional counterexample is $f(x_1, x_2) = |x_1 - x_2| + |x_1 + x_2 - 1|$ at the point $(0, 0)$ : with $x_2 = 0$ fixed, the function $|x_1| + |x_1 - 1|$ is constant on $[0, 1]$ , so $x_1 = 0$ is a one-dimensional minimizer; symmetrically for $x_2$ . Cyclic CD is stuck at $(0, 0)$ with $f = 1$ , while the global minimum is $(\tfrac{1}{2}, \tfrac{1}{2})$ with $f = 0$ . (Tseng 2001 shows that block separability of the nonsmooth term is what underwrites convergence; Powell 1973 gave an analogous smooth counterexample for the case where exact minimizers along each axis are non-unique.) For non-separable penalties, use proximal gradient or ADMM instead.

report a correction →

Canonical Examples

Example

Lasso coordinate update

For $A \in \mathbb{R}^{100 \times 1000}$ , $b \in \mathbb{R}^{100}$ , and $\lambda = 0.1$ : precompute $A^TA$ and $A^Tb$ . Each coordinate update reads one row of $A^TA$ and one entry of $A^Tb$ . A full epoch scans all 1000 coordinates. In practice, active set strategies skip coordinates already at zero, making each epoch even cheaper.

Example

Gauss-Seidel for linear systems

To solve $Ax = b$ with $A$ positive definite, iterate: $x_j \leftarrow (b_j - \sum_{i \neq j} A_{ij} x_i) / A_{jj}$ for $j = 1, \ldots, d$ . This converges if $A$ is strictly diagonally dominant or symmetric positive definite.

Common Confusions

Watch Out

Coordinate descent is not gradient descent with a random mask

Coordinate descent solves the one-dimensional subproblem exactly (or near exactly). Simply zeroing out all but one component of the gradient and taking a step is a different algorithm with potentially worse behavior. The exact minimization along each coordinate is what gives coordinate descent its good properties.

Watch Out

Cyclic vs randomized: theory vs practice

Randomized coordinate descent has cleaner theory ( $O(1/k)$ rate with nice constants). But cyclic coordinate descent often works better in practice because it ensures every coordinate gets updated each epoch. The theory for cyclic is harder and was settled only recently.

Summary

Update one coordinate at a time; each step is a one-dimensional problem
Separable penalties (like $\ell_1$ ) give closed-form coordinate updates
Cyclic CD is the default Lasso solver; no step size to tune
Randomized CD has cleaner theory: $O(d/k)$ per iteration, $O(1/k)$ per epoch
Block CD generalizes to groups of coordinates
Gauss-Seidel is CD applied to quadratic objectives

Exercises

ExerciseCore

Problem

Write out the coordinate descent update for coordinate $j$ of the Lasso problem $\min_x \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1$ in terms of the columns of $A$ and the current residual $r = b - Ax$ .

ExerciseAdvanced

Problem

Give a two-dimensional example where cyclic coordinate descent on a nonsmooth, non-separable function fails to find the minimum. What goes wrong geometrically?

References

Canonical:

Tseng, "Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization," Journal of Optimization Theory and Applications 109(3):475-494 (2001)
Wright, "Coordinate Descent Algorithms," Mathematical Programming 151(1):3-34 (2015)
Powell, "On Search Directions for Minimization Algorithms," Mathematical Programming 4(1):193-201 (1973)

Current:

Friedman, Hastie, Tibshirani, "Regularization Paths for Generalized Linear Models via Coordinate Descent," Journal of Statistical Software 33(1):1-22 (2010)
Nesterov, "Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems," SIAM Journal on Optimization 22(2):341-362 (2012)
Beck & Tetruashvili, "On the Convergence of Block Coordinate Descent Type Methods," SIAM Journal on Optimization 23(4):2037-2060 (2013)
Saha & Tewari, "On the Nonasymptotic Convergence of Cyclic Coordinate Descent Methods," SIAM Journal on Optimization 23(1):576-601 (2013)

Next Topics

The natural next steps from coordinate descent:

Proximal gradient methods: for non-separable penalties
Stochastic gradient descent: randomness in samples rather than coordinates

Last reviewed: April 27, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Convex Optimization Basicslayer 1 · tier 1
Proximal Gradient Methodslayer 2 · tier 1
Mirror Descent and Frank-Wolfelayer 3 · tier 2

Derived topics

2

Stochastic Gradient Descent Convergencelayer 2 · tier 1
Nonlinear Gauss-Seidellayer 3 · tier 3

Graph-backed continuations

Stochastic Gradient Descent Convergence Nonlinear Gauss-Seidel