Numerical Optimization
Projected Gradient Descent
Constrained convex optimization by alternating gradient steps with projections onto the feasible set. Same convergence rates as unconstrained gradient descent when projections are cheap.
Prerequisites
Why This Matters
Many optimization problems in ML have constraints: probability distributions live on the simplex, norms are bounded, parameters must be non-negative. Projected gradient descent (PGD) is the simplest extension of gradient descent to constrained settings. Take a gradient step, then project back onto the feasible set. If the projection is cheap, you get the same convergence rates as unconstrained gradient descent.
Understanding PGD also provides the baseline against which more sophisticated methods (mirror descent, Frank-Wolfe, ADMM) are compared.
The Algorithm
Projected Gradient Descent
Given a convex set , a convex function with -Lipschitz gradient, and step size :
where is the Euclidean projection onto .
The algorithm alternates two operations: a gradient step (which may leave the feasible set) and a projection (which maps back to the nearest feasible point).
Euclidean Projection
The Euclidean projection of onto a closed convex set is:
This always exists and is unique for closed convex sets. Key property: for all (the projection points "inward").
Projection costs for common sets:
- Box constraints : by clipping each coordinate.
- Simplex : by sorting and finding a threshold.
- ball: by rescaling.
- Semidefinite cone: by eigendecomposition (expensive).
- Polytopes (general): solving a QP (can be expensive).
Convergence Theory
PGD Convergence for Convex Functions
Statement
With step size and initial point , projected gradient descent satisfies:
This is the rate for convex smooth optimization.
Intuition
The projection never increases the distance to the optimum (since is already in ). So each gradient step makes progress, and the projection does not undo it. The rate is identical to unconstrained gradient descent.
Proof Sketch
By smoothness, . By the projection property and convexity, . Using the non-expansiveness of projection (), telescope the squared distances and sum to get the result.
Why It Matters
This establishes that constraints do not slow down gradient descent, as long as the projection is computationally cheap. The rate matches unconstrained GD, and the per-iteration cost is just one gradient computation plus one projection.
Failure Mode
If projection onto is expensive (e.g., for the PSD cone), then each iteration of PGD is dominated by the projection cost. In such cases, Frank-Wolfe (which replaces projection with linear minimization) or ADMM (which handles constraints via augmented Lagrangian) may be preferable.
PGD Linear Convergence for Strongly Convex Functions
Statement
If is -strongly convex and has -Lipschitz gradient, then with step size :
The convergence is linear with rate where is the condition number.
Intuition
Strong convexity means the function curves upward at least as fast as . This curvature ensures that each projected gradient step contracts the distance to the optimum by a constant factor, giving exponential convergence.
Proof Sketch
The key inequality is (by non-expansiveness of projection, since ). Expand and use strong convexity plus smoothness to bound from below. This yields .
Why It Matters
Linear convergence means that to get accuracy, you need iterations. This is exponentially faster than the rate for plain convex functions. Many ML objectives (regularized problems, strongly convex losses) enjoy this rate.
Failure Mode
The condition number controls the rate. Ill-conditioned problems (large ) converge slowly. Preconditioning or acceleration (Nesterov momentum applied to PGD) can help, reducing the iteration count to .
Common Confusions
Projection is not the same as clipping gradients
Gradient clipping truncates the gradient vector to have bounded norm. Projection maps the iterate back to the feasible set. These are different operations with different purposes. Gradient clipping addresses exploding gradients; projection enforces constraints on the solution.
Non-expansiveness does not mean projection is free
The projection operator is non-expansive (), which is a mathematical property used in convergence proofs. This says nothing about computational cost. Projection onto a simplex costs ; projection onto a PSD cone costs .
Canonical Examples
Constrained least squares on a box
Minimize subject to . The gradient is . Each PGD step computes and projects by clipping: . With (the reciprocal of the smoothness constant), this converges at rate .
Exercises
Problem
Compute the Euclidean projection of onto the non-negative orthant . Then compute the projection onto the ball .
Problem
Prove the non-expansiveness of Euclidean projection: for any closed convex set and any , .
References
Canonical:
- Bertsekas, Nonlinear Programming (1999), Section 2.3
- Boyd & Vandenberghe, Convex Optimization (2004), Section 9.3
Current:
- Bubeck, "Convex Optimization: Algorithms and Complexity" (2015), Section 3.1
- Beck, First-Order Methods in Optimization (2017), Chapter 9
Next Topics
- Mirror descent and Frank-Wolfe: generalizations for when Euclidean geometry or projections are suboptimal
- Augmented Lagrangian and ADMM: alternative approaches to constrained optimization
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A