Optimization Function Classes
Subgradients and Subdifferentials
The non-smooth generalization of the gradient for convex functions. Subgradients enable optimality conditions, calculus rules, and convergence guarantees for L1-regularized problems, hinge loss SVMs, and proximal algorithms where the objective is not differentiable.
Prerequisites
Why This Matters
Most modern ML objectives are not smooth. The L1 penalty in lasso regression is not differentiable at zero. The hinge loss in support vector machines kinks at the margin. The ReLU activation has no derivative at zero. The total-variation regularizer has gradient discontinuities at every level set.
Standard gradient-based optimization assumes the gradient exists, which is exactly what fails on these problems. The right replacement is the subgradient, which generalizes the gradient to convex but non-smooth functions. Every convex function has a non-empty subdifferential at every interior point of its domain, even where the gradient does not exist. This is what makes proximal gradient methods, the lasso, hinge-loss SVMs, and ADMM work as theory rather than heuristics.
This page is the foundational reference. Convex-optimization-basics covers the smooth case; this page covers the non-smooth case that nearly every sparsity-inducing or regularized ML method actually uses.
Definition
Subgradient
Let be convex. A vector is a subgradient of at if The set of all subgradients of at is the subdifferential:
A subgradient is any vector such that the affine function stays below everywhere and touches it at . This is the convex-analytic generalization of the tangent line: at a smooth convex function, only one such affine support exists (the tangent), and . At a kink, an entire family of supporting affine functions exists, and is the set of their slopes.
Examples
Absolute value on .
Away from zero, is differentiable and the subdifferential is the singleton containing the derivative. At zero, every slope in gives a supporting line that stays below . The subdifferential at the kink is the closed interval of all such slopes.
norm on .
Coordinate-wise: each is the sign of at non-zero coordinates and any value in at zero coordinates. This is the subdifferential that drives soft-thresholding in proximal methods.
Hinge loss .
The slope is in the loss region, in the no-loss region, and the entire interval at the kink. The kink at is exactly the SVM margin.
Indicator function of a closed convex set .
This is the normal cone to at : the set of outward-pointing directions that do not enter . At interior points, ; on the boundary, the normal cone is non-trivial. This is how convex constraints enter optimality conditions.
Smooth convex . at every point of differentiability.
Existence
Existence of Subgradients
Statement
For any proper convex and any point in the relative interior of , the subdifferential is non-empty, closed, convex, and bounded.
If is differentiable at , then .
If is on the boundary of or is not lower semicontinuous at , the subdifferential may be empty.
Intuition
Geometrically: a convex function has supporting hyperplanes at every interior point of its domain (this is the supporting hyperplane theorem applied to the epigraph). Each supporting hyperplane has a normal vector whose first coordinates form a subgradient. The set of such normals forms a closed convex bounded set, which is exactly the subdifferential.
Proof Sketch
Apply the supporting hyperplane theorem to the epigraph at the boundary point . The supporting hyperplane has normal for some (the is forced by the epigraph being "-monotone"). The supporting condition becomes for all , which is exactly the subgradient inequality. Closure and convexity of follow because it is an intersection of half-spaces. Boundedness on the interior follows from being locally Lipschitz on the relative interior of its domain.
Why It Matters
Existence on the interior is what licenses every algorithm that picks "any subgradient" at each step. Without existence you could not even define a subgradient method, let alone analyze it. The boundary caveat matters for indicator functions of constraints, where the subdifferential at constraint boundaries is the normal cone (which can be unbounded).
Failure Mode
The subdifferential is empty at points outside , and can be empty at boundary points where jumps from finite to . Example: for , for . At , every "subgradient" would need for all , which forces as . No finite works, so .
Optimality Condition
The single most important consequence of the subdifferential framework: it gives a clean optimality condition for non-smooth convex minimization.
Subdifferential Optimality Condition
Statement
is a global minimizer of if and only if This is the non-smooth analogue of for smooth convex functions.
Intuition
The subgradient inequality with becomes for all , which is exactly the definition of being a global minimizer. Conversely, if is a global minimizer, the constant subgradient satisfies the defining inequality.
Proof Sketch
If : By the subgradient inequality at , for all . So is a global minimizer.
If is a global minimizer: Take . Then holds for all , so .
Why It Matters
This is the optimality condition used to derive the soft-thresholding operator (the proximal operator of the norm), to prove KKT conditions for constrained problems, and to characterize lasso solutions. For lasso , the optimality condition becomes , which gives the explicit characterization that for zero coordinates of and equality with the right sign for non-zero coordinates.
Failure Mode
This optimality condition characterizes global minimizers; for non-convex , (with generalized to Clarke subgradients) is necessary but not sufficient for local minimality, and saddle points satisfy it. The full strength of "iff global minimum" requires convexity.
Subgradient Calculus
The subdifferential satisfies calculus rules analogous to gradient rules, with a few caveats around equality versus containment.
Sum rule. For convex , , with equality if .
Scalar multiplication. For , .
Affine pre-composition. For convex and , , .
Pointwise maximum. For convex and , where is the set of "active" functions at .
The maximum-rule is what gives the subdifferential of the hinge loss and the dual norm, and it is the source of the convex hull appearing in many non-smooth optimality conditions.
The Subgradient Method
The simplest non-smooth optimization algorithm replaces the gradient in gradient descent with any subgradient.
Algorithm. Choose step sizes . At iteration :
- Pick any .
- Update .
Unlike gradient descent on smooth functions, the subgradient method is not a descent method: can exceed even with optimal step sizes. The standard guarantee is on the best iterate.
Convergence. For convex with subgradients of bounded norm , and step sizes :
This rate is provably slower than the rate of gradient descent on smooth convex functions, and the gap cannot be closed without exploiting structure. The slower rate is what motivates proximal methods: when the non-smooth term is a "simple" function (like the norm) whose proximal operator has a closed form, proximal-gradient methods recover the rate by handling the smooth part with gradient descent and the non-smooth part exactly.
Common Confusions
Subgradient method is not a descent method
For smooth convex functions with gradient descent, every step reduces the objective. For non-smooth convex functions with the subgradient method, the objective can increase between iterations. The convergence guarantee is on the best iterate seen so far, , not on . This is why averaging schemes (Polyak averaging, Nemirovski-Yudin) are common in subgradient analyses.
Subgradients are only defined for convex functions
For non-convex functions, the convex-analytic subgradient does not exist in general. The Clarke subdifferential and the limiting subdifferential extend the concept to locally Lipschitz non-convex functions and provide necessary conditions for minimality, but the clean "iff global min" of the convex case is lost. ReLU networks are not convex, so the "subgradients" used in their training are Clarke subgradients (any element of the convex hull of limit-of-gradient sequences), not the subgradients defined here.
The subdifferential of a sum can be larger than the sum of subdifferentials
, with equality only under a constraint qualification (relint domain intersection non-empty). Without it, you can have a strict containment. Example: take on extended by , and on extended by . Then is the indicator of , and . But is computed at the boundary of each domain, where neither is finite. The qualification "relint domains intersect" rules this out.
Summary
- A subgradient of a convex at is any such that for all . The set of subgradients is the subdifferential .
- Subdifferentials exist (non-empty, closed, convex, bounded) at every point in the relative interior of .
- is a global minimizer of convex iff . This is the non-smooth analogue of .
- Subgradient calculus rules (sum, chain, pointwise max) follow gradient rules with constraint-qualification caveats.
- The subgradient method runs at , slower than gradient descent on smooth functions; proximal methods recover when the non-smooth term has a tractable proximal operator.
Exercises
Problem
Compute the subdifferential of (the ReLU function viewed as a univariate convex function) at every point. Then verify the optimality condition for the global minimizer.
Problem
Derive the soft-thresholding operator from the subdifferential optimality condition. Specifically, show that the unique solution to for is given by This is the proximal operator of at the point .
References
Canonical:
- Rockafellar, "Convex Analysis" (Princeton, 1970), Sections 23-25
- Hiriart-Urruty and Lemarechal, "Fundamentals of Convex Analysis" (Springer, 2001), Chapter D
- Boyd and Vandenberghe, "Convex Optimization" (Cambridge, 2004), Section 3.1.5 (gradient and subgradient overlap), Appendix C
Convergence theory:
- Nesterov, "Introductory Lectures on Convex Optimization" (Springer, 2004), Section 3.2 (subgradient method)
- Bubeck, "Convex Optimization: Algorithms and Complexity" (Foundations and Trends in ML, 2015), Sections 3.1-3.2
ML applications:
- Beck, "First-Order Methods in Optimization" (SIAM, 2017), Chapters 3, 6, 10 (proximal operators, subgradient computations for ML losses)
- Parikh and Boyd, "Proximal Algorithms" (Foundations and Trends in Optimization, 2014)
Next Topics
- Proximal gradient methods: how to use subgradient structure efficiently when the non-smooth term is "simple"
- Lasso regression: the canonical application via soft-thresholding
- Convex duality: KKT conditions are subgradient optimality conditions for the Lagrangian
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A