Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Mirror Descent and Frank-Wolfe

Mirror descent generalizes gradient descent via Bregman divergences, recovering multiplicative weights and exponentiated gradient as special cases. Frank-Wolfe replaces projections with linear minimization, making it projection-free.

AdvancedTier 2Stable~55 min
0

Why This Matters

Standard gradient descent uses the Euclidean distance to measure how far you move at each step. This is natural for unconstrained optimization in Rd\mathbb{R}^d, but it is often the wrong geometry for constrained problems. Optimizing over the simplex (probability distributions), positive definite matrices, or combinatorial polytopes calls for a distance measure that respects the constraint geometry.

Mirror descent replaces Euclidean distance with a Bregman divergence, adapting the algorithm to the problem geometry. Frank-Wolfe goes further: it avoids projection entirely by solving a linear subproblem over the constraint set at each step.

Bregman Divergences

Definition

Bregman Divergence

Given a strictly convex, differentiable function ϕ:XR\phi: \mathcal{X} \to \mathbb{R} (the mirror map or distance-generating function), the Bregman divergence is:

Dϕ(x,y)=ϕ(x)ϕ(y)ϕ(y),xyD_\phi(x, y) = \phi(x) - \phi(y) - \langle \nabla \phi(y), x - y \rangle

This measures the gap between ϕ(x)\phi(x) and its first-order Taylor approximation at yy. It is always non-negative by convexity of ϕ\phi, but it is generally not symmetric: Dϕ(x,y)Dϕ(y,x)D_\phi(x, y) \neq D_\phi(y, x).

Key examples:

  • ϕ(x)=12x22\phi(x) = \frac{1}{2}\|x\|_2^2: Bregman divergence is Dϕ(x,y)=12xy22D_\phi(x, y) = \frac{1}{2}\|x - y\|_2^2 (squared Euclidean distance). Recovers standard gradient descent.
  • ϕ(x)=ixilogxi\phi(x) = \sum_i x_i \log x_i (negative entropy): Bregman divergence is the KL divergence DKL(xy)=ixilog(xi/yi)D_{\text{KL}}(x \| y) = \sum_i x_i \log(x_i / y_i). Natural for the simplex.

Mirror Descent

Definition

Mirror Descent Update

Given a convex function ff to minimize over a convex set X\mathcal{X}, the mirror descent update with step size η\eta is:

xt+1=argminxX{ηf(xt),x+Dϕ(x,xt)}x_{t+1} = \arg\min_{x \in \mathcal{X}} \left\{ \eta \langle \nabla f(x_t), x \rangle + D_\phi(x, x_t) \right\}

Equivalently, in the "dual space": compute θt+1=ϕ(xt)ηf(xt)\theta_{t+1} = \nabla \phi(x_t) - \eta \nabla f(x_t), then "mirror back" via xt+1=argminxXDϕ(x,(ϕ)1(θt+1))x_{t+1} = \arg\min_{x \in \mathcal{X}} D_\phi(x, (\nabla \phi)^{-1}(\theta_{t+1})).

The update finds the point in X\mathcal{X} that best balances two objectives: move in the direction of the negative gradient (minimize the linear term) while staying close to xtx_t (minimize the Bregman divergence term).

Special cases:

  • Euclidean mirror map (ϕ=1222\phi = \frac{1}{2}\|\cdot\|_2^2): recovers projected gradient descent.
  • Entropic mirror map (ϕ=xilogxi\phi = \sum x_i \log x_i) over the simplex: recovers the exponentiated gradient / multiplicative weights update xt+1,ixt,iexp(ηif(xt))x_{t+1,i} \propto x_{t,i} \exp(-\eta \nabla_i f(x_t)).
Theorem

Mirror Descent Convergence

Statement

After TT iterations of mirror descent with step size η=R2LT\eta = \frac{R\sqrt{2}}{L\sqrt{T}}, where R2=maxxXDϕ(x,x1)R^2 = \max_{x \in \mathcal{X}} D_\phi(x, x_1) and f(x)L\|\nabla f(x)\|_* \leq L for all xx:

f(1Tt=1Txt)f(x)RL2Tf\left(\frac{1}{T}\sum_{t=1}^T x_t\right) - f(x^*) \leq \frac{RL\sqrt{2}}{\sqrt{T}}

This is the O(1/T)O(1/\sqrt{T}) rate for convex Lipschitz optimization.

Intuition

The rate depends on the ratio R/σR/\sigma where RR is the "diameter" in Bregman divergence and LL is the gradient bound in the dual norm. Choosing ϕ\phi to match the geometry of the constraint set can make RR much smaller than the Euclidean diameter, giving faster effective convergence.

Proof Sketch

The standard mirror descent analysis telescopes the Bregman divergence to the optimum: Dϕ(x,xt+1)Dϕ(x,xt)η(f(xt)f(x))+η22σf(xt)2D_\phi(x^*, x_{t+1}) \leq D_\phi(x^*, x_t) - \eta(f(x_t) - f(x^*)) + \frac{\eta^2}{2\sigma}\|\nabla f(x_t)\|_*^2. Sum from t=1t = 1 to TT, use Dϕ(x,x1)R2D_\phi(x^*, x_1) \leq R^2 and Dϕ(x,xT+1)0D_\phi(x^*, x_{T+1}) \geq 0, then optimize η\eta.

Why It Matters

Mirror descent with the entropic mirror map on the simplex achieves O(logd/T)O(\sqrt{\log d / T}) regret, compared to O(d/T)O(\sqrt{d / T}) for projected gradient descent. The logarithmic dependence on dimension dd is a huge advantage in high-dimensional settings like online learning with exponentially many experts.

Failure Mode

The choice of mirror map ϕ\phi is crucial. A poorly chosen ϕ\phi gives a Bregman diameter RR that is no better (or worse) than Euclidean distance. The analysis also requires that the mirror map be strongly convex with respect to the chosen norm, which may be difficult to verify. For non-Lipschitz objectives, the rate degrades.

Frank-Wolfe (Conditional Gradient)

Definition

Frank-Wolfe Update

The Frank-Wolfe (conditional gradient) update for minimizing a smooth convex function ff over a compact convex set X\mathcal{X}:

  1. Solve the linear minimization oracle (LMO): vt=argminvXf(xt),vv_t = \arg\min_{v \in \mathcal{X}} \langle \nabla f(x_t), v \rangle
  2. Update: xt+1=xt+γt(vtxt)x_{t+1} = x_t + \gamma_t (v_t - x_t) where γt[0,1]\gamma_t \in [0, 1]

The step direction vtxtv_t - x_t always points into the feasible set, so no projection is needed.

The LMO is often much cheaper than projection. Over the nuclear norm ball, the LMO requires a single top singular vector computation (O(d2)O(d^2)), while projection requires a full SVD (O(d3)O(d^3)). Over polytopes, the LMO is a linear program.

Theorem

Frank-Wolfe Convergence Rate

Statement

With step sizes γt=2/(t+2)\gamma_t = 2/(t+2), the Frank-Wolfe algorithm satisfies:

f(xT)f(x)2Ldiam(X)2T+2f(x_T) - f(x^*) \leq \frac{2L \cdot \text{diam}(\mathcal{X})^2}{T + 2}

where diam(X)=maxx,yXxy\text{diam}(\mathcal{X}) = \max_{x, y \in \mathcal{X}} \|x - y\|.

Intuition

The O(1/T)O(1/T) rate is slower than the O(1/T2)O(1/T^2) of accelerated gradient descent, but each iteration only requires a linear minimization, not a projection. The trade-off is worth it when the LMO is cheap and projection is expensive.

Proof Sketch

By smoothness, f(xt+1)f(xt)+γtf(xt),vtxt+Lγt22vtxt2f(x_{t+1}) \leq f(x_t) + \gamma_t \langle \nabla f(x_t), v_t - x_t \rangle + \frac{L \gamma_t^2}{2}\|v_t - x_t\|^2. Since vtv_t minimizes the linear term over X\mathcal{X}, we have f(xt),vtxtf(xt),xxt(f(xt)f(x))\langle \nabla f(x_t), v_t - x_t \rangle \leq \langle \nabla f(x_t), x^* - x_t \rangle \leq -(f(x_t) - f(x^*)) by convexity. Substituting and setting γt=2/(t+2)\gamma_t = 2/(t+2) gives the result by induction.

Why It Matters

Frank-Wolfe iterates are sparse: each iterate is a convex combination of at most TT vertices of X\mathcal{X}. This makes it natural for problems where sparse solutions are desirable (e.g., sparse optimization, low-rank matrix completion). The method also never leaves the feasible set, which matters when feasibility has physical meaning.

Failure Mode

The O(1/T)O(1/T) rate cannot be improved in general for Frank-Wolfe without modifications (away steps, pairwise steps). If the optimum is in the interior of X\mathcal{X} rather than on the boundary, convergence slows due to the "zig-zagging" phenomenon. For strongly convex objectives, vanilla Frank-Wolfe does not achieve the linear rate that projected gradient descent achieves.

Common Confusions

Watch Out

Mirror descent is not just projected gradient descent with a different norm

While projected gradient descent in the 1\ell_1 norm exists, it is not the same as mirror descent with the entropic mirror map. Mirror descent changes the geometry of the update step itself (using Bregman divergence instead of squared distance), not just the metric used in the projection. The two coincide only when ϕ=122\phi = \frac{1}{2}\|\cdot\|^2.

Watch Out

Frank-Wolfe and mirror descent solve different bottlenecks

Mirror descent replaces Euclidean distance with a better geometry but still requires a Bregman projection. Frank-Wolfe eliminates projection entirely by using a linear oracle. If projection is cheap (e.g., onto a ball), use projected gradient descent or mirror descent. If projection is expensive but linear minimization is cheap (e.g., over a polytope or nuclear norm ball), use Frank-Wolfe.

Canonical Examples

Example

Multiplicative weights from mirror descent

On the probability simplex Δd={x0:ixi=1}\Delta_d = \{x \geq 0 : \sum_i x_i = 1\}, use the negative entropy ϕ(x)=ixilogxi\phi(x) = \sum_i x_i \log x_i. The mirror descent update for a linear loss f(x)=gt,xf(x) = \langle g_t, x \rangle gives:

xt+1,i=xt,iexp(ηgt,i)jxt,jexp(ηgt,j)x_{t+1,i} = \frac{x_{t,i} \exp(-\eta g_{t,i})}{\sum_j x_{t,j} \exp(-\eta g_{t,j})}

This is the multiplicative weights update, with O(logd/T)O(\sqrt{\log d / T}) regret. The logd\log d (instead of dd) dependence on dimension is the key advantage over Euclidean methods.

Exercises

ExerciseCore

Problem

Verify that the Bregman divergence for ϕ(x)=12x22\phi(x) = \frac{1}{2}\|x\|_2^2 is the squared Euclidean distance 12xy22\frac{1}{2}\|x - y\|_2^2, and that the mirror descent update reduces to the projected gradient descent update.

ExerciseAdvanced

Problem

Consider Frank-Wolfe on the 1\ell_1 ball {x:x11}\{x : \|x\|_1 \leq 1\} in Rd\mathbb{R}^d. What does the linear minimization oracle compute? What is the sparsity of the iterate xTx_T after TT steps if x0=0x_0 = 0?

References

Canonical:

  • Nemirovski & Yudin, Problem Complexity and Method Efficiency in Optimization (1983), Chapter 5
  • Frank & Wolfe, "An algorithm for quadratic programming" (1956)
  • Bubeck, "Convex Optimization: Algorithms and Complexity" (2015), Sections 4.2, 4.3

Current:

  • Jaggi, "Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization" (2013)
  • Hazan, Introduction to Online Convex Optimization (2016), Chapter 5

Next Topics

  • Projected gradient descent: the standard projection-based approach that mirror descent generalizes
  • Coordinate descent: another projection-free approach for structured problems

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics