Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Proximal Gradient Methods

Optimize composite objectives by alternating gradient steps on smooth terms with proximal operators on nonsmooth terms. ISTA and its accelerated variant FISTA.

CoreTier 1Stable~55 min

Why This Matters

Many important ML objectives have the form "smooth loss plus nonsmooth regularizer." Lasso is the canonical example: squared error (smooth) plus an 1\ell_1 penalty (nonsmooth). You cannot just take gradients of the full objective because the 1\ell_1 norm is not differentiable everywhere.

Proximal gradient methods solve this cleanly. They handle the smooth and nonsmooth parts separately: a gradient step on the smooth part, then a proximal step on the nonsmooth part. This is the algorithmic backbone of sparse optimization in ML.

Mental Model

Imagine sliding down a smooth hill (gradient descent on the smooth part), then snapping to the nearest point that respects some structural constraint (the proximal step). The proximal operator acts like a "project and shrink" operation. For the 1\ell_1 norm, it is soft-thresholding: it pushes small coordinates exactly to zero, producing sparsity.

Formal Setup and Notation

We want to solve the composite minimization problem:

minxRd  F(x)=f(x)+g(x)\min_{x \in \mathbb{R}^d} \; F(x) = f(x) + g(x)

where ff is convex and smooth (has LL-Lipschitz continuous gradient) and gg is convex but possibly nonsmooth (e.g., g(x)=λx1g(x) = \lambda \|x\|_1).

Definition

Proximal Operator

The proximal operator of a convex function gg at point xx is:

proxg(x)=argminy(g(y)+12yx2)\mathrm{prox}_g(x) = \arg\min_{y} \left( g(y) + \frac{1}{2}\|y - x\|^2 \right)

It finds the point that balances being close to xx (the quadratic term) with having small gg value. The proximal operator always exists and is unique when gg is convex, lower semicontinuous, and proper.

Definition

Moreau Envelope

The Moreau envelope of gg with parameter γ>0\gamma > 0 is:

Mgγ(x)=miny(g(y)+12γyx2)M_g^\gamma(x) = \min_{y} \left( g(y) + \frac{1}{2\gamma}\|y - x\|^2 \right)

It is a smooth approximation of gg. Even when gg is nonsmooth, the Moreau envelope MgγM_g^\gamma is differentiable with gradient Mgγ(x)=1γ(xproxγg(x))\nabla M_g^\gamma(x) = \frac{1}{\gamma}(x - \mathrm{prox}_{\gamma g}(x)). As γ0\gamma \to 0, the Moreau envelope converges to gg itself.

Core Definitions

For the 1\ell_1 norm g(x)=λx1g(x) = \lambda \|x\|_1, the proximal operator is soft-thresholding:

[proxλ1(x)]i=sign(xi)max(xiλ,0)[\mathrm{prox}_{\lambda \|\cdot\|_1}(x)]_i = \mathrm{sign}(x_i) \max(|x_i| - \lambda, 0)

This shrinks each coordinate toward zero and sets coordinates with xiλ|x_i| \leq \lambda exactly to zero. This is how proximal methods produce sparse solutions.

For the indicator function of a convex set CC, g(x)=IC(x)g(x) = I_C(x) (zero if xCx \in C, infinity otherwise), the proximal operator is the Euclidean projection onto CC.

The ISTA Algorithm

The Iterative Shrinkage-Thresholding Algorithm (ISTA) repeats:

xk+1=proxtkg ⁣(xktkf(xk))x_{k+1} = \mathrm{prox}_{t_k g}\!\left(x_k - t_k \nabla f(x_k)\right)

where tk1/Lt_k \leq 1/L is the step size and LL is the Lipschitz constant of f\nabla f.

Step by step: (1) take a gradient step on the smooth part ff, producing an intermediate point z=xktkf(xk)z = x_k - t_k \nabla f(x_k); (2) apply the proximal operator of gg to zz, handling the nonsmooth part.

The FISTA Algorithm

Fast ISTA (FISTA) adds Nesterov momentum:

Initialize x0=y1x_0 = y_1, t1=1t_1 = 1. Then repeat:

xk=proxsg ⁣(yksf(yk))x_k = \mathrm{prox}_{s \cdot g}\!\left(y_k - s \cdot \nabla f(y_k)\right)

tk+1=1+1+4tk22t_{k+1} = \frac{1 + \sqrt{1 + 4t_k^2}}{2}

yk+1=xk+tk1tk+1(xkxk1)y_{k+1} = x_k + \frac{t_k - 1}{t_{k+1}}(x_k - x_{k-1})

The extrapolation step yk+1y_{k+1} looks ahead using momentum, which is what gives the acceleration.

Main Theorems

Theorem

ISTA Convergence Rate

Statement

Let F=minxF(x)F^* = \min_x F(x). ISTA with constant step size t=1/Lt = 1/L satisfies:

F(xk)FLx0x22kF(x_k) - F^* \leq \frac{L \|x_0 - x^*\|^2}{2k}

where xx^* is a minimizer of FF.

Intuition

The convergence rate is O(1/k)O(1/k). After kk iterations, the suboptimality shrinks like 1/k1/k. This matches gradient descent on smooth functions. The proximal step handles nonsmoothness without slowing convergence.

Proof Sketch

Show that the proximal gradient step guarantees sufficient decrease: F(xk+1)F(x)+f(xk)T(xk+1x)+L2xk+1x2+g(xk+1)g(x)F(x_{k+1}) \leq F(x) + \nabla f(x_k)^T(x_{k+1} - x) + \frac{L}{2}\|x_{k+1} - x\|^2 + g(x_{k+1}) - g(x) for all xx. Set x=xx = x^* and telescope the resulting inequality.

Why It Matters

This shows proximal gradient descent is as fast as gradient descent, even though the objective is nonsmooth. The proximal operator absorbs the nonsmoothness at no cost to the convergence rate.

Failure Mode

The O(1/k)O(1/k) rate is tight for ISTA. You cannot do better without modification. Also, if LL is unknown you need a line search or adaptive step size, which adds cost per iteration.

Theorem

FISTA Accelerated Convergence Rate

Statement

FISTA satisfies:

F(xk)F2Lx0x2(k+1)2F(x_k) - F^* \leq \frac{2L \|x_0 - x^*\|^2}{(k+1)^2}

Intuition

Acceleration improves the rate from O(1/k)O(1/k) to O(1/k2)O(1/k^2). After 100 iterations, ISTA reduces the gap by 100×100\times; FISTA reduces it by 10,000×10{,}000\times. The momentum term lets the algorithm "look ahead" using the trajectory, extracting more information per step.

Proof Sketch

Define a Lyapunov function Ek=tk2(F(xk)F)+L2vkx2E_k = t_k^2(F(x_k) - F^*) + \frac{L}{2}\|v_k - x^*\|^2 where vkv_k is an auxiliary sequence. Show Ek+1EkE_{k+1} \leq E_k by using the descent property and the specific choice of momentum parameters. Since tk=Θ(k)t_k = \Theta(k), this gives the O(1/k2)O(1/k^2) rate.

Why It Matters

O(1/k2)O(1/k^2) is the optimal rate for first-order methods on convex problems with Lipschitz gradients (Nesterov 1983). FISTA achieves this optimal rate for composite objectives. The improvement from O(1/k)O(1/k) to O(1/k2)O(1/k^2) is substantial in practice.

Failure Mode

FISTA's iterates can oscillate because of the momentum term. It is not a monotone algorithm: F(xk+1)F(x_{k+1}) can be larger than F(xk)F(x_k). Monotone variants exist but may sacrifice some speed. Also, for strongly convex ff, you can get linear convergence with a restarting strategy.

Canonical Examples

Example

Lasso via ISTA

The Lasso problem is minx12Axb2+λx1\min_x \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1. Here f(x)=12Axb2f(x) = \frac{1}{2}\|Ax - b\|^2 with f(x)=AT(Axb)\nabla f(x) = A^T(Ax - b) and L=ATAL = \|A^TA\|. The proximal step is soft-thresholding. Each ISTA iteration: compute z=xk1LAT(Axkb)z = x_k - \frac{1}{L} A^T(Ax_k - b), then xk+1=sign(z)max(zλ/L,0)x_{k+1} = \mathrm{sign}(z) \odot \max(|z| - \lambda/L, 0).

Example

Projected gradient descent as a special case

When g=ICg = I_C is the indicator of a convex set CC, the proximal operator is projection onto CC. So ISTA becomes projected gradient descent: xk+1=ΠC(xktf(xk))x_{k+1} = \Pi_C(x_k - t \nabla f(x_k)). Constrained smooth optimization is a special case of proximal gradient methods.

Common Confusions

Watch Out

Proximal operator is not projection

The proximal operator for a general gg is not a projection onto a set. It is a projection-like operation that balances proximity to the input with minimizing gg. Projection is the special case where gg is an indicator function. For the 1\ell_1 norm, the proximal operator is soft-thresholding, which shrinks coordinates rather than projecting.

Watch Out

FISTA does not always beat ISTA in practice

FISTA has a better worst-case rate, but its oscillatory behavior can make it slower on some problems. Restarting FISTA (resetting momentum when the objective increases) often works better in practice. The theoretical guarantee is about worst-case, not every-case.

Summary

  • Proximal operator: proxg(x)=argminy(g(y)+yx2/2)\mathrm{prox}_g(x) = \arg\min_y (g(y) + \|y-x\|^2/2)
  • ISTA: gradient step on smooth ff, then proximal step on nonsmooth gg
  • ISTA converges at O(1/k)O(1/k); FISTA with momentum at O(1/k2)O(1/k^2)
  • For 1\ell_1 regularization, the proximal operator is soft-thresholding
  • The Moreau envelope smooths a nonsmooth function while preserving minimizers

Exercises

ExerciseCore

Problem

Compute proxλ1(x)\mathrm{prox}_{\lambda \|\cdot\|_1}(x) for x=(3,1,0.5)x = (3, -1, 0.5) and λ=1\lambda = 1. Which coordinates become zero?

ExerciseAdvanced

Problem

Show that the proximal operator of the indicator function ICI_C of a closed convex set CC is the Euclidean projection ΠC\Pi_C.

ExerciseResearch

Problem

Prove that the Moreau envelope MgγM_g^\gamma is differentiable even when gg is not, and show Mgγ(x)=1γ(xproxγg(x))\nabla M_g^\gamma(x) = \frac{1}{\gamma}(x - \mathrm{prox}_{\gamma g}(x)).

References

Canonical:

  • Beck & Teboulle, "A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems" (2009)
  • Nesterov, Introductory Lectures on Convex Optimization (2004), Chapter 2

Current:

  • Parikh & Boyd, "Proximal Algorithms," Foundations and Trends in Optimization (2014)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009)

Next Topics

The natural next steps from proximal gradient methods:

  • Coordinate descent: an alternative for separable penalties
  • Stochastic gradient descent: scaling to large datasets

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics