Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Projected Gradient Descent

Constrained convex optimization by alternating gradient steps with projections onto the feasible set. Same convergence rates as unconstrained gradient descent when projections are cheap.

CoreTier 2Stable~40 min
0

Why This Matters

Many optimization problems in ML have constraints: probability distributions live on the simplex, norms are bounded, parameters must be non-negative. Projected gradient descent (PGD) is the simplest extension of gradient descent to constrained settings. Take a gradient step, then project back onto the feasible set. If the projection is cheap, you get the same convergence rates as unconstrained gradient descent.

Understanding PGD also provides the baseline against which more sophisticated methods (mirror descent, Frank-Wolfe, ADMM) are compared.

The Algorithm

Definition

Projected Gradient Descent

Given a convex set XRd\mathcal{X} \subseteq \mathbb{R}^d, a convex function f:XRf: \mathcal{X} \to \mathbb{R} with LL-Lipschitz gradient, and step size η\eta:

xt+1=ΠX(xtηf(xt))x_{t+1} = \Pi_\mathcal{X}(x_t - \eta \nabla f(x_t))

where ΠX(z)=argminxXxz22\Pi_\mathcal{X}(z) = \arg\min_{x \in \mathcal{X}} \|x - z\|_2^2 is the Euclidean projection onto X\mathcal{X}.

The algorithm alternates two operations: a gradient step (which may leave the feasible set) and a projection (which maps back to the nearest feasible point).

Definition

Euclidean Projection

The Euclidean projection of zz onto a closed convex set X\mathcal{X} is:

ΠX(z)=argminxX12xz22\Pi_\mathcal{X}(z) = \arg\min_{x \in \mathcal{X}} \frac{1}{2}\|x - z\|_2^2

This always exists and is unique for closed convex sets. Key property: zΠX(z),xΠX(z)0\langle z - \Pi_\mathcal{X}(z), x - \Pi_\mathcal{X}(z) \rangle \leq 0 for all xXx \in \mathcal{X} (the projection points "inward").

Projection costs for common sets:

  • Box constraints [ai,bi][a_i, b_i]: O(d)O(d) by clipping each coordinate.
  • Simplex Δd\Delta_d: O(dlogd)O(d \log d) by sorting and finding a threshold.
  • 2\ell_2 ball: O(d)O(d) by rescaling.
  • Semidefinite cone: O(d3)O(d^3) by eigendecomposition (expensive).
  • Polytopes (general): solving a QP (can be expensive).

Convergence Theory

Theorem

PGD Convergence for Convex Functions

Statement

With step size η=1/L\eta = 1/L and initial point x0Xx_0 \in \mathcal{X}, projected gradient descent satisfies:

f(1Tt=0T1xt)f(x)Lx0x22Tf\left(\frac{1}{T}\sum_{t=0}^{T-1} x_t\right) - f(x^*) \leq \frac{L\|x_0 - x^*\|^2}{2T}

This is the O(1/T)O(1/T) rate for convex smooth optimization.

Intuition

The projection never increases the distance to the optimum (since xx^* is already in X\mathcal{X}). So each gradient step makes progress, and the projection does not undo it. The rate is identical to unconstrained gradient descent.

Proof Sketch

By smoothness, f(xt+1)f(xt)+f(xt),xt+1xt+L2xt+1xt2f(x_{t+1}) \leq f(x_t) + \langle \nabla f(x_t), x_{t+1} - x_t \rangle + \frac{L}{2}\|x_{t+1} - x_t\|^2. By the projection property and convexity, f(xt),xtxf(xt)f(x)\langle \nabla f(x_t), x_t - x^* \rangle \geq f(x_t) - f(x^*). Using the non-expansiveness of projection (Π(z1)Π(z2)z1z2\|\Pi(z_1) - \Pi(z_2)\| \leq \|z_1 - z_2\|), telescope the squared distances xtx2\|x_t - x^*\|^2 and sum to get the result.

Why It Matters

This establishes that constraints do not slow down gradient descent, as long as the projection is computationally cheap. The O(1/T)O(1/T) rate matches unconstrained GD, and the per-iteration cost is just one gradient computation plus one projection.

Failure Mode

If projection onto X\mathcal{X} is expensive (e.g., O(d3)O(d^3) for the PSD cone), then each iteration of PGD is dominated by the projection cost. In such cases, Frank-Wolfe (which replaces projection with linear minimization) or ADMM (which handles constraints via augmented Lagrangian) may be preferable.

Theorem

PGD Linear Convergence for Strongly Convex Functions

Statement

If ff is μ\mu-strongly convex and has LL-Lipschitz gradient, then with step size η=1/L\eta = 1/L:

xTx2(1μL)Tx0x2\|x_T - x^*\|^2 \leq \left(1 - \frac{\mu}{L}\right)^T \|x_0 - x^*\|^2

The convergence is linear with rate 11/κ1 - 1/\kappa where κ=L/μ\kappa = L/\mu is the condition number.

Intuition

Strong convexity means the function curves upward at least as fast as μ2xx2\frac{\mu}{2}\|x - x^*\|^2. This curvature ensures that each projected gradient step contracts the distance to the optimum by a constant factor, giving exponential convergence.

Proof Sketch

The key inequality is xt+1x2xtηf(xt)x2\|x_{t+1} - x^*\|^2 \leq \|x_t - \eta \nabla f(x_t) - x^*\|^2 (by non-expansiveness of projection, since xXx^* \in \mathcal{X}). Expand and use strong convexity plus smoothness to bound f(xt),xtx\langle \nabla f(x_t), x_t - x^* \rangle from below. This yields xt+1x2(1μ/L)xtx2\|x_{t+1} - x^*\|^2 \leq (1 - \mu/L)\|x_t - x^*\|^2.

Why It Matters

Linear convergence means that to get ϵ\epsilon accuracy, you need O(κlog(1/ϵ))O(\kappa \log(1/\epsilon)) iterations. This is exponentially faster than the O(1/ϵ)O(1/\epsilon) rate for plain convex functions. Many ML objectives (regularized problems, strongly convex losses) enjoy this rate.

Failure Mode

The condition number κ=L/μ\kappa = L/\mu controls the rate. Ill-conditioned problems (large κ\kappa) converge slowly. Preconditioning or acceleration (Nesterov momentum applied to PGD) can help, reducing the iteration count to O(κlog(1/ϵ))O(\sqrt{\kappa} \log(1/\epsilon)).

Common Confusions

Watch Out

Projection is not the same as clipping gradients

Gradient clipping truncates the gradient vector to have bounded norm. Projection maps the iterate back to the feasible set. These are different operations with different purposes. Gradient clipping addresses exploding gradients; projection enforces constraints on the solution.

Watch Out

Non-expansiveness does not mean projection is free

The projection operator is non-expansive (Π(x)Π(y)xy\|\Pi(x) - \Pi(y)\| \leq \|x - y\|), which is a mathematical property used in convergence proofs. This says nothing about computational cost. Projection onto a simplex costs O(dlogd)O(d \log d); projection onto a PSD cone costs O(d3)O(d^3).

Canonical Examples

Example

Constrained least squares on a box

Minimize f(x)=12Axb2f(x) = \frac{1}{2}\|Ax - b\|^2 subject to 0xi10 \leq x_i \leq 1. The gradient is f(x)=AT(Axb)\nabla f(x) = A^T(Ax - b). Each PGD step computes z=xtηAT(Axtb)z = x_t - \eta A^T(Ax_t - b) and projects by clipping: xt+1,i=min(1,max(0,zi))x_{t+1,i} = \min(1, \max(0, z_i)). With η=1/A2\eta = 1/\|A\|^2 (the reciprocal of the smoothness constant), this converges at rate O(1/T)O(1/T).

Exercises

ExerciseCore

Problem

Compute the Euclidean projection of z=(0.5,1.5,0.3)z = (0.5, 1.5, -0.3) onto the non-negative orthant X={xR3:xi0}\mathcal{X} = \{x \in \mathbb{R}^3 : x_i \geq 0\}. Then compute the projection onto the 2\ell_2 ball {x:x1}\{x : \|x\| \leq 1\}.

ExerciseAdvanced

Problem

Prove the non-expansiveness of Euclidean projection: for any closed convex set X\mathcal{X} and any z1,z2Rdz_1, z_2 \in \mathbb{R}^d, ΠX(z1)ΠX(z2)z1z2\|\Pi_\mathcal{X}(z_1) - \Pi_\mathcal{X}(z_2)\| \leq \|z_1 - z_2\|.

References

Canonical:

  • Bertsekas, Nonlinear Programming (1999), Section 2.3
  • Boyd & Vandenberghe, Convex Optimization (2004), Section 9.3

Current:

  • Bubeck, "Convex Optimization: Algorithms and Complexity" (2015), Section 3.1
  • Beck, First-Order Methods in Optimization (2017), Chapter 9

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics