Projected Gradient Descent

Sneiderman, Robby

Numerical Optimization

Projected Gradient Descent

Constrained convex optimization by alternating gradient steps with projections onto the feasible set. Same convergence rates as unconstrained gradient descent when projections are cheap.

CoreTier 2StableSupporting~40 min

Prerequisites

Convex Optimization Basics

Start 8-question practice · 6 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

numerical-optimization | layer 2 | tier 2. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Mirror Descent and Frank-Wolfe

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Many optimization problems in ML have constraints: probability distributions live on the simplex, norms are bounded, parameters must be non-negative. Projected gradient descent (PGD) is the simplest extension of gradient descent to constrained settings. Take a gradient step, then project back onto the feasible set. If the projection is cheap, you get the same convergence rates as unconstrained gradient descent.

Understanding PGD also provides the baseline against which more sophisticated methods (mirror descent, Frank-Wolfe, ADMM) are compared.

The Algorithm

Definition

Projected Gradient Descent

Given a convex set $\mathcal{X} \subseteq \mathbb{R}^d$ , a convex function $f: \mathcal{X} \to \mathbb{R}$ with $L$ -Lipschitz gradient, and step size $\eta$ :

$x_{t+1} = \Pi_\mathcal{X}(x_t - \eta \nabla f(x_t))$

where $\Pi_\mathcal{X}(z) = \arg\min_{x \in \mathcal{X}} \|x - z\|_2^2$ is the Euclidean projection onto $\mathcal{X}$ .

The algorithm alternates two operations: a gradient step (which may leave the feasible set) and a projection (which maps back to the nearest feasible point).

Definition

Euclidean Projection $Π_{X} (z)$

The Euclidean projection of $z$ onto a closed convex set $\mathcal{X}$ is:

$\Pi_\mathcal{X}(z) = \arg\min_{x \in \mathcal{X}} \frac{1}{2}\|x - z\|_2^2$

This always exists and is unique for closed convex sets. Key property: $\langle z - \Pi_\mathcal{X}(z), x - \Pi_\mathcal{X}(z) \rangle \leq 0$ for all $x \in \mathcal{X}$ (the projection points "inward").

Projection costs for common sets:

Box constraints $[a_i, b_i]$ : $O(d)$ by clipping each coordinate.
Simplex $\Delta_d$ : $O(d \log d)$ by sorting and finding a threshold.
$\ell_2$ ball: $O(d)$ by rescaling.
Semidefinite cone: $O(d^3)$ by eigendecomposition (expensive).
Polytopes (general): solving a QP (can be expensive).

Convergence Theory

Theorem

PGD Convergence for Convex Functions

Statement

With step size $\eta = 1/L$ and initial point $x_0 \in \mathcal{X}$ , projected gradient descent satisfies:

$f\left(\frac{1}{T}\sum_{t=0}^{T-1} x_t\right) - f(x^*) \leq \frac{L\|x_0 - x^*\|^2}{2T}$

This is the $O(1/T)$ rate for convex smooth optimization.

Intuition

The projection never increases the distance to the optimum (since $x^*$ is already in $\mathcal{X}$ ). So each gradient step makes progress, and the projection does not undo it. The rate is identical to unconstrained gradient descent.

Proof Sketch

By smoothness, $f(x_{t+1}) \leq f(x_t) + \langle \nabla f(x_t), x_{t+1} - x_t \rangle + \frac{L}{2}\|x_{t+1} - x_t\|^2$ . By the projection property and convexity, $\langle \nabla f(x_t), x_t - x^* \rangle \geq f(x_t) - f(x^*)$ . Using the non-expansiveness of projection ( $\|\Pi(z_1) - \Pi(z_2)\| \leq \|z_1 - z_2\|$ ), telescope the squared distances $\|x_t - x^*\|^2$ and sum to get the result.

Why It Matters

This establishes that constraints do not slow down gradient descent, as long as the projection is computationally cheap. The $O(1/T)$ rate matches unconstrained GD, and the per-iteration cost is just one gradient computation plus one projection.

Failure Mode

If projection onto $\mathcal{X}$ is expensive (e.g., $O(d^3)$ for the PSD cone), then each iteration of PGD is dominated by the projection cost. In such cases, Frank-Wolfe (which replaces projection with linear minimization) or ADMM (which handles constraints via augmented Lagrangian) may be preferable.

report a correction →

Theorem

PGD Linear Convergence for Strongly Convex Functions

Statement

If $f$ is $\mu$ -strongly convex and has $L$ -Lipschitz gradient, then with step size $\eta = 1/L$ :

$\|x_T - x^*\|^2 \leq \left(1 - \frac{\mu}{L}\right)^T \|x_0 - x^*\|^2$

The convergence is linear with rate $1 - 1/\kappa$ where $\kappa = L/\mu$ is the condition number.

Intuition

Strong convexity means the function curves upward at least as fast as $\frac{\mu}{2}\|x - x^*\|^2$ . This curvature ensures that each projected gradient step contracts the distance to the optimum by a constant factor, giving exponential convergence.

Proof Sketch

The key inequality is $\|x_{t+1} - x^*\|^2 \leq \|x_t - \eta \nabla f(x_t) - x^*\|^2$ (by non-expansiveness of projection, since $x^* \in \mathcal{X}$ ). Expand and use strong convexity plus smoothness to bound $\langle \nabla f(x_t), x_t - x^* \rangle$ from below. This yields $\|x_{t+1} - x^*\|^2 \leq (1 - \mu/L)\|x_t - x^*\|^2$ .

Why It Matters

Linear convergence means that to get $\epsilon$ accuracy, you need $O(\kappa \log(1/\epsilon))$ iterations. This is exponentially faster than the $O(1/\epsilon)$ rate for plain convex functions. Many ML objectives (regularized problems, strongly convex losses) enjoy this rate.

Failure Mode

The condition number $\kappa = L/\mu$ controls the rate. Ill-conditioned problems (large $\kappa$ ) converge slowly. Preconditioning or acceleration (Nesterov momentum applied to PGD) can help, reducing the iteration count to $O(\sqrt{\kappa} \log(1/\epsilon))$ .

report a correction →

Common Confusions

Watch Out

Projection is not the same as clipping gradients

Gradient clipping truncates the gradient vector to have bounded norm. Projection maps the iterate back to the feasible set. These are different operations with different purposes. Gradient clipping addresses exploding gradients; projection enforces constraints on the solution.

Watch Out

Non-expansiveness does not mean projection is free

The projection operator is non-expansive ( $\|\Pi(x) - \Pi(y)\| \leq \|x - y\|$ ), which is a mathematical property used in convergence proofs. This says nothing about computational cost. Projection onto a simplex costs $O(d \log d)$ ; projection onto a PSD cone costs $O(d^3)$ .

Canonical Examples

Example

Constrained least squares on a box

Minimize $f(x) = \frac{1}{2}\|Ax - b\|^2$ subject to $0 \leq x_i \leq 1$ . The gradient is $\nabla f(x) = A^T(Ax - b)$ . Each PGD step computes $z = x_t - \eta A^T(Ax_t - b)$ and projects by clipping: $x_{t+1,i} = \min(1, \max(0, z_i))$ . With $\eta = 1/\|A\|^2$ (the reciprocal of the smoothness constant), this converges at rate $O(1/T)$ .

Exercises

ExerciseCore

Problem

Compute the Euclidean projection of $z = (0.5, 1.5, -0.3)$ onto the non-negative orthant $\mathcal{X} = \{x \in \mathbb{R}^3 : x_i \geq 0\}$ . Then compute the projection onto the $\ell_2$ ball $\{x : \|x\| \leq 1\}$ .

ExerciseAdvanced

Problem

Prove the non-expansiveness of Euclidean projection: for any closed convex set $\mathcal{X}$ and any $z_1, z_2 \in \mathbb{R}^d$ , $\|\Pi_\mathcal{X}(z_1) - \Pi_\mathcal{X}(z_2)\| \leq \|z_1 - z_2\|$ .

References

Canonical:

Bertsekas, Nonlinear Programming (1999), Section 2.3
Boyd & Vandenberghe, Convex Optimization (2004), Section 9.3

Current:

Bubeck, "Convex Optimization: Algorithms and Complexity" (2015), Section 3.1
Beck, First-Order Methods in Optimization (2017), Chapter 9

Next Topics

Mirror descent and Frank-Wolfe: generalizations for when Euclidean geometry or projections are suboptimal
Augmented Lagrangian and ADMM: alternative approaches to constrained optimization

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Convex Optimization Basicslayer 1 · tier 1

Derived topics

2

Augmented Lagrangian and ADMMlayer 2 · tier 2
Mirror Descent and Frank-Wolfelayer 3 · tier 2

Graph-backed continuations

Mirror Descent and Frank-Wolfe Augmented Lagrangian and ADMM