Numerical Optimization
Proximal Gradient Methods
Optimize composite objectives by alternating gradient steps on smooth terms with proximal operators on nonsmooth terms. ISTA and its accelerated variant FISTA.
Prerequisites
Why This Matters
Many important ML objectives have the form "smooth loss plus nonsmooth regularizer." Lasso is the canonical example: squared error (smooth) plus an penalty (nonsmooth). You cannot just take gradients of the full objective because the norm is not differentiable everywhere.
Proximal gradient methods solve this cleanly. They handle the smooth and nonsmooth parts separately: a gradient step on the smooth part, then a proximal step on the nonsmooth part. This is the algorithmic backbone of sparse optimization in ML.
Mental Model
Imagine sliding down a smooth hill (gradient descent on the smooth part), then snapping to the nearest point that respects some structural constraint (the proximal step). The proximal operator acts like a "project and shrink" operation. For the norm, it is soft-thresholding: it pushes small coordinates exactly to zero, producing sparsity.
Formal Setup and Notation
We want to solve the composite minimization problem:
where is convex and smooth (has -Lipschitz continuous gradient) and is convex but possibly nonsmooth (e.g., ).
Proximal Operator
The proximal operator of a convex function at point is:
It finds the point that balances being close to (the quadratic term) with having small value. The proximal operator always exists and is unique when is convex, lower semicontinuous, and proper.
Moreau Envelope
The Moreau envelope of with parameter is:
It is a smooth approximation of . Even when is nonsmooth, the Moreau envelope is differentiable with gradient . As , the Moreau envelope converges to itself.
Core Definitions
For the norm , the proximal operator is soft-thresholding:
This shrinks each coordinate toward zero and sets coordinates with exactly to zero. This is how proximal methods produce sparse solutions.
For the indicator function of a convex set , (zero if , infinity otherwise), the proximal operator is the Euclidean projection onto .
The ISTA Algorithm
The Iterative Shrinkage-Thresholding Algorithm (ISTA) repeats:
where is the step size and is the Lipschitz constant of .
Step by step: (1) take a gradient step on the smooth part , producing an intermediate point ; (2) apply the proximal operator of to , handling the nonsmooth part.
The FISTA Algorithm
Fast ISTA (FISTA) adds Nesterov momentum:
Initialize , . Then repeat:
The extrapolation step looks ahead using momentum, which is what gives the acceleration.
Main Theorems
ISTA Convergence Rate
Statement
Let . ISTA with constant step size satisfies:
where is a minimizer of .
Intuition
The convergence rate is . After iterations, the suboptimality shrinks like . This matches gradient descent on smooth functions. The proximal step handles nonsmoothness without slowing convergence.
Proof Sketch
Show that the proximal gradient step guarantees sufficient decrease: for all . Set and telescope the resulting inequality.
Why It Matters
This shows proximal gradient descent is as fast as gradient descent, even though the objective is nonsmooth. The proximal operator absorbs the nonsmoothness at no cost to the convergence rate.
Failure Mode
The rate is tight for ISTA. You cannot do better without modification. Also, if is unknown you need a line search or adaptive step size, which adds cost per iteration.
FISTA Accelerated Convergence Rate
Statement
FISTA satisfies:
Intuition
Acceleration improves the rate from to . After 100 iterations, ISTA reduces the gap by ; FISTA reduces it by . The momentum term lets the algorithm "look ahead" using the trajectory, extracting more information per step.
Proof Sketch
Define a Lyapunov function where is an auxiliary sequence. Show by using the descent property and the specific choice of momentum parameters. Since , this gives the rate.
Why It Matters
is the optimal rate for first-order methods on convex problems with Lipschitz gradients (Nesterov 1983). FISTA achieves this optimal rate for composite objectives. The improvement from to is substantial in practice.
Failure Mode
FISTA's iterates can oscillate because of the momentum term. It is not a monotone algorithm: can be larger than . Monotone variants exist but may sacrifice some speed. Also, for strongly convex , you can get linear convergence with a restarting strategy.
Canonical Examples
Lasso via ISTA
The Lasso problem is . Here with and . The proximal step is soft-thresholding. Each ISTA iteration: compute , then .
Projected gradient descent as a special case
When is the indicator of a convex set , the proximal operator is projection onto . So ISTA becomes projected gradient descent: . Constrained smooth optimization is a special case of proximal gradient methods.
Common Confusions
Proximal operator is not projection
The proximal operator for a general is not a projection onto a set. It is a projection-like operation that balances proximity to the input with minimizing . Projection is the special case where is an indicator function. For the norm, the proximal operator is soft-thresholding, which shrinks coordinates rather than projecting.
FISTA does not always beat ISTA in practice
FISTA has a better worst-case rate, but its oscillatory behavior can make it slower on some problems. Restarting FISTA (resetting momentum when the objective increases) often works better in practice. The theoretical guarantee is about worst-case, not every-case.
Summary
- Proximal operator:
- ISTA: gradient step on smooth , then proximal step on nonsmooth
- ISTA converges at ; FISTA with momentum at
- For regularization, the proximal operator is soft-thresholding
- The Moreau envelope smooths a nonsmooth function while preserving minimizers
Exercises
Problem
Compute for and . Which coordinates become zero?
Problem
Show that the proximal operator of the indicator function of a closed convex set is the Euclidean projection .
Problem
Prove that the Moreau envelope is differentiable even when is not, and show .
References
Canonical:
- Beck & Teboulle, "A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems" (2009)
- Nesterov, Introductory Lectures on Convex Optimization (2004), Chapter 2
Current:
-
Parikh & Boyd, "Proximal Algorithms," Foundations and Trends in Optimization (2014)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009)
Next Topics
The natural next steps from proximal gradient methods:
- Coordinate descent: an alternative for separable penalties
- Stochastic gradient descent: scaling to large datasets
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A