Numerical Optimization
Coordinate Descent
Optimize by updating one coordinate (or block) at a time while holding others fixed. The default solver for Lasso because each coordinate update has a closed-form solution.
Prerequisites
Why This Matters
Coordinate descent is the workhorse behind the most widely used Lasso solvers (glmnet, scikit-learn's Lasso). When the nonsmooth penalty is separable across coordinates, each coordinate update has a closed-form solution. This makes coordinate descent extremely fast for high-dimensional sparse problems, often faster than proximal gradient methods.
Mental Model
Instead of moving in all directions at once (full gradient step), pick one coordinate axis and slide along it to the optimal point. Then pick another coordinate and repeat. Each step is cheap (a one-dimensional optimization) and for many problems it has an exact solution.
Think of it like tuning a guitar: you adjust one string at a time, cycling through all strings, and eventually the whole instrument is in tune.
Formal Setup and Notation
We want to minimize:
where is convex and smooth, and each is convex and acts on a single coordinate. The separability of is what makes coordinate descent efficient.
Coordinate Update Rule
At iteration , pick coordinate and update:
All other coordinates stay fixed. This is a one-dimensional optimization problem in .
Cyclic Coordinate Descent
In cyclic (or Gauss-Seidel) coordinate descent, you cycle through coordinates in a fixed order One full pass through all coordinates is called an epoch.
Randomized Coordinate Descent
In randomized coordinate descent, at each step you pick coordinate uniformly at random from . This version has cleaner convergence theory because each step is an unbiased estimator of progress.
Core Definitions
For the Lasso problem , the coordinate update for coordinate is:
where is the -th column of , and denote all other columns and coordinates, and is the soft-thresholding operator.
This closed-form update is why coordinate descent is the default for Lasso. No step size tuning is needed.
Block coordinate descent generalizes this: instead of updating a single coordinate, update a block of coordinates at a time. This is useful when the objective has natural block structure (e.g., group lasso).
Gauss-Seidel for solving is cyclic coordinate descent applied to (when is positive definite). Each coordinate update is .
Main Theorems
Convergence of Cyclic Coordinate Descent
Statement
Let where is convex with coordinate-wise Lipschitz gradients . Cyclic coordinate descent produces iterates satisfying:
For randomized coordinate descent with step size for coordinate :
Intuition
Each coordinate update can only decrease the objective (or leave it unchanged). The separability of ensures no coupling between coordinates in the nonsmooth part. The smooth part provides enough curvature that progress on individual coordinates translates to global progress. The factor in the rate reflects that each epoch touches coordinates.
Proof Sketch
For the randomized case: at each step, the expected decrease from updating a random coordinate is of the decrease you would get from a full gradient step. Apply the standard descent lemma coordinate-wise and take expectations. Telescope over steps to get the rate per iteration, which is per epoch.
Why It Matters
The per-epoch rate matches gradient descent, but each epoch of coordinate descent can be cheaper if the coordinate updates are fast. For Lasso, each coordinate update is and an epoch is , the same as one gradient step but with a much smaller constant because it avoids matrix multiplications.
Failure Mode
Cyclic coordinate descent can fail to converge for non-separable nonsmooth . The classic counterexample is minimizing : cyclic updates leave each coordinate unchanged if the other is fixed. For non-separable penalties, use proximal gradient or ADMM instead.
Canonical Examples
Lasso coordinate update
For , , and : precompute and . Each coordinate update reads one row of and one entry of . A full epoch scans all 1000 coordinates. In practice, active set strategies skip coordinates already at zero, making each epoch even cheaper.
Gauss-Seidel for linear systems
To solve with positive definite, iterate: for . This converges if is strictly diagonally dominant or symmetric positive definite.
Common Confusions
Coordinate descent is not gradient descent with a random mask
Coordinate descent solves the one-dimensional subproblem exactly (or near exactly). Simply zeroing out all but one component of the gradient and taking a step is a different algorithm with potentially worse behavior. The exact minimization along each coordinate is what gives coordinate descent its good properties.
Cyclic vs randomized: theory vs practice
Randomized coordinate descent has cleaner theory ( rate with nice constants). But cyclic coordinate descent often works better in practice because it ensures every coordinate gets updated each epoch. The theory for cyclic is harder and was settled only recently.
Summary
- Update one coordinate at a time; each step is a one-dimensional problem
- Separable penalties (like ) give closed-form coordinate updates
- Cyclic CD is the default Lasso solver; no step size to tune
- Randomized CD has cleaner theory: per iteration, per epoch
- Block CD generalizes to groups of coordinates
- Gauss-Seidel is CD applied to quadratic objectives
Exercises
Problem
Write out the coordinate descent update for coordinate of the Lasso problem in terms of the columns of and the current residual .
Problem
Give a two-dimensional example where cyclic coordinate descent on a nonsmooth, non-separable function fails to find the minimum. What goes wrong geometrically?
References
Canonical:
- Tseng, "Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization" (2001)
- Wright, "Coordinate Descent Algorithms," Mathematical Programming (2015)
Current:
- Friedman, Hastie, Tibshirani, "Regularization Paths for GLMs via Coordinate Descent" (2010)
Next Topics
The natural next steps from coordinate descent:
- Proximal gradient methods: for non-separable penalties
- Stochastic gradient descent: randomness in samples rather than coordinates
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
Builds on This
- Nonlinear Gauss-SeidelLayer 3