Subgradients and Subdifferentials

Sneiderman, Robby

Optimization Function Classes

Subgradients and Subdifferentials

The non-smooth generalization of the gradient for convex functions. Subgradients enable optimality conditions, calculus rules, and convergence guarantees for L1-regularized problems, hinge loss SVMs, and proximal algorithms where the objective is not differentiable.

CoreTier 1StableCore spine~35 min

Prerequisites

Convex Optimization Basics

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

optimization-function-classes | layer 1 | tier 1. This page has 1 direct prerequisite and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Proximal Gradient Methods

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

theorem visual

Subgradient Support

$At a smooth point there is one supporting slope; at a kink there is a whole interval of valid subgradients.$

Evaluation point

x = 0.0

∂f(x) = [−0.8, 1.2]

fan of valid supporting slopes at the kink

Most modern ML objectives are not smooth. The L1 penalty in lasso regression is not differentiable at zero. The hinge loss in support vector machines kinks at the margin. The ReLU activation has no derivative at zero. The total-variation regularizer has gradient discontinuities at every level set.

Standard gradient-based optimization assumes the gradient exists, which is exactly what fails on these problems. The right replacement is the subgradient, which generalizes the gradient to convex but non-smooth functions. Every convex function has a non-empty subdifferential at every interior point of its domain, even where the gradient does not exist. This is what makes proximal gradient methods, the lasso, hinge-loss SVMs, and ADMM work as theory rather than heuristics.

This page is the foundational reference. Convex-optimization-basics covers the smooth case; this page covers the non-smooth case that nearly every sparsity-inducing or regularized ML method actually uses.

Definition

Subgradient $g \in \partial f (x)$

Let $f : \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\}$ be convex. A vector $g \in \mathbb{R}^d$ is a subgradient of $f$ at $x$ if $f(y) \geq f(x) + \langle g, y - x \rangle \quad \text{for all } y \in \mathbb{R}^d.$ The set of all subgradients of $f$ at $x$ is the subdifferential: $\partial f(x) = \{g \in \mathbb{R}^d : f(y) \geq f(x) + \langle g, y - x\rangle \ \forall y\}.$

A subgradient is any vector $g$ such that the affine function $y \mapsto f(x) + \langle g, y - x\rangle$ stays below $f$ everywhere and touches it at $x$ . This is the convex-analytic generalization of the tangent line: at a smooth convex function, only one such affine support exists (the tangent), and $g = \nabla f(x)$ . At a kink, an entire family of supporting affine functions exists, and $\partial f(x)$ is the set of their slopes.

Examples

Absolute value $f(x) = |x|$ on $\mathbb{R}$ .

$\partial f(x) = \begin{cases} \{1\} & x > 0 \\ [-1, 1] & x = 0 \\ \{-1\} & x < 0 \end{cases}$

Away from zero, $f$ is differentiable and the subdifferential is the singleton containing the derivative. At zero, every slope in $[-1, 1]$ gives a supporting line that stays below $|x|$ . The subdifferential at the kink is the closed interval of all such slopes.

$\ell_1$ norm $f(x) = \|x\|_1 = \sum_i |x_i|$ on $\mathbb{R}^d$ .

$\partial f(x) = \{g \in \mathbb{R}^d : g_i = \text{sign}(x_i) \text{ if } x_i \neq 0, \ g_i \in [-1, 1] \text{ if } x_i = 0\}.$

Coordinate-wise: each $g_i$ is the sign of $x_i$ at non-zero coordinates and any value in $[-1, 1]$ at zero coordinates. This is the subdifferential that drives soft-thresholding in proximal methods.

Hinge loss $f(x) = \max(0, 1 - x)$ .

$\partial f(x) = \begin{cases} \{-1\} & x < 1 \\ [-1, 0] & x = 1 \\ \{0\} & x > 1 \end{cases}$

The slope is $-1$ in the loss region, $0$ in the no-loss region, and the entire interval $[-1, 0]$ at the kink. The kink at $x = 1$ is exactly the SVM margin.

Indicator function $\delta_C$ of a closed convex set $C$ .

$\partial \delta_C(x) = \{g : \langle g, y - x\rangle \leq 0 \ \forall y \in C\} = N_C(x)$

This is the normal cone to $C$ at $x$ : the set of outward-pointing directions that do not enter $C$ . At interior points, $N_C(x) = \{0\}$ ; on the boundary, the normal cone is non-trivial. This is how convex constraints enter optimality conditions.

Smooth convex $f$ . $\partial f(x) = \{\nabla f(x)\}$ at every point of differentiability.

Existence

Theorem

Existence of Subgradients

Statement

For any proper convex $f : \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\}$ and any point $x$ in the relative interior of $\text{dom}\, f$ , the subdifferential $\partial f(x)$ is non-empty, closed, convex, and bounded.

If $f$ is differentiable at $x$ , then $\partial f(x) = \{\nabla f(x)\}$ .

If $x$ is on the boundary of $\text{dom}\, f$ or $f$ is not lower semicontinuous at $x$ , the subdifferential may be empty.

Intuition

Geometrically: a convex function has supporting hyperplanes at every interior point of its domain (this is the supporting hyperplane theorem applied to the epigraph). Each supporting hyperplane has a normal vector whose first $d$ coordinates form a subgradient. The set of such normals forms a closed convex bounded set, which is exactly the subdifferential.

Proof Sketch

Apply the supporting hyperplane theorem to the epigraph $\text{epi}\, f = \{(x, t) : t \geq f(x)\}$ at the boundary point $(x, f(x))$ . The supporting hyperplane has normal $(g, -1)$ for some $g \in \mathbb{R}^d$ (the $-1$ is forced by the epigraph being " $t$ -monotone"). The supporting condition becomes $f(y) - f(x) \geq \langle g, y - x\rangle$ for all $y$ , which is exactly the subgradient inequality. Closure and convexity of $\partial f(x)$ follow because it is an intersection of half-spaces. Boundedness on the interior follows from $f$ being locally Lipschitz on the relative interior of its domain.

Why It Matters

Existence on the interior is what licenses every algorithm that picks "any subgradient" at each step. Without existence you could not even define a subgradient method, let alone analyze it. The boundary caveat matters for indicator functions of constraints, where the subdifferential at constraint boundaries is the normal cone (which can be unbounded).

Failure Mode

The subdifferential is empty at points outside $\text{dom}\, f$ , and can be empty at boundary points where $f$ jumps from finite to $+\infty$ . Example: $f(x) = -\sqrt{x}$ for $x \geq 0$ , $f(x) = +\infty$ for $x < 0$ . At $x = 0$ , every "subgradient" $g$ would need $-\sqrt{y} \geq g \cdot y$ for all $y \geq 0$ , which forces $g \leq -\sqrt{y}/y \to -\infty$ as $y \to 0^+$ . No finite $g$ works, so $\partial f(0) = \emptyset$ .

report a correction →

Optimality Condition

The single most important consequence of the subdifferential framework: it gives a clean optimality condition for non-smooth convex minimization.

Theorem

Subdifferential Optimality Condition

Statement

$x^*$ is a global minimizer of $f$ if and only if $0 \in \partial f(x^*).$ This is the non-smooth analogue of $\nabla f(x^*) = 0$ for smooth convex functions.

Intuition

The subgradient inequality $f(y) \geq f(x^*) + \langle g, y - x^*\rangle$ with $g = 0$ becomes $f(y) \geq f(x^*)$ for all $y$ , which is exactly the definition of $x^*$ being a global minimizer. Conversely, if $x^*$ is a global minimizer, the constant subgradient $g = 0$ satisfies the defining inequality.

Proof Sketch

If $0 \in \partial f(x^*)$ : By the subgradient inequality at $g = 0$ , $f(y) \geq f(x^*) + 0 = f(x^*)$ for all $y$ . So $x^*$ is a global minimizer.

If $x^*$ is a global minimizer: Take $g = 0$ . Then $f(y) \geq f(x^*) = f(x^*) + \langle 0, y - x^*\rangle$ holds for all $y$ , so $0 \in \partial f(x^*)$ .

Why It Matters

This is the optimality condition used to derive the soft-thresholding operator (the proximal operator of the $\ell_1$ norm), to prove KKT conditions for constrained problems, and to characterize lasso solutions. For lasso $\min_w \tfrac{1}{2} \|y - Xw\|_2^2 + \lambda \|w\|_1$ , the optimality condition $0 \in \partial f(w^*)$ becomes $X^\top(y - Xw^*) \in \lambda \, \partial \|w^*\|_1$ , which gives the explicit characterization that $|X_j^\top(y - Xw^*)| \leq \lambda$ for zero coordinates of $w^*$ and equality with the right sign for non-zero coordinates.

Failure Mode

This optimality condition characterizes global minimizers; for non-convex $f$ , $0 \in \partial f(x^*)$ (with $\partial$ generalized to Clarke subgradients) is necessary but not sufficient for local minimality, and saddle points satisfy it. The full strength of "iff global minimum" requires convexity.

report a correction →

Subgradient Calculus

The subdifferential satisfies calculus rules analogous to gradient rules, with a few caveats around equality versus containment.

Sum rule. For convex $f, g$ , $\partial f(x) + \partial g(x) \subseteq \partial(f + g)(x)$ , with equality if $\text{relint}(\text{dom}\, f) \cap \text{relint}(\text{dom}\, g) \neq \emptyset$ .

Scalar multiplication. For $\alpha > 0$ , $\partial(\alpha f)(x) = \alpha \, \partial f(x)$ .

Affine pre-composition. For $f$ convex and $A \in \mathbb{R}^{m \times d}$ , $b \in \mathbb{R}^m$ , $\partial(f(Ax + b))(x) = A^\top \partial f(Ax + b)$ .

Pointwise maximum. For convex $f_1, \ldots, f_m$ and $f(x) = \max_i f_i(x)$ , $\partial f(x) = \text{conv}\!\left(\bigcup_{i \in I(x)} \partial f_i(x)\right)$ where $I(x) = \{i : f_i(x) = f(x)\}$ is the set of "active" functions at $x$ .

The maximum-rule is what gives the subdifferential of the hinge loss and the dual norm, and it is the source of the convex hull appearing in many non-smooth optimality conditions.

The Subgradient Method

The simplest non-smooth optimization algorithm replaces the gradient in gradient descent with any subgradient.

Algorithm. Choose step sizes $\eta_k > 0$ . At iteration $k$ :

Pick any $g_k \in \partial f(x_k)$ .
Update $x_{k+1} = x_k - \eta_k \, g_k$ .

Unlike gradient descent on smooth functions, the subgradient method is not a descent method: $f(x_{k+1})$ can exceed $f(x_k)$ even with optimal step sizes. The standard guarantee is on the best iterate.

Theorem

Subgradient Method Convergence Rate

Statement

For $f$ convex with $\|g\| \leq G$ for all $g \in \partial f(x)$ and step sizes $\eta_k = c/\sqrt{k}$ :

$\min_{0 \leq j \leq k} f(x_j) - f^* = O\!\left(\frac{G \|x_0 - x^*\|}{\sqrt{k}}\right).$

With the optimal constant-step-size choice $\eta = \|x_0 - x^*\|/(G\sqrt{k})$ , the bound becomes $f(\bar{x}_k) - f^* \leq G\|x_0 - x^*\|/\sqrt{k}$ where $\bar{x}_k$ is the average iterate.

Intuition

Each subgradient step reduces the distance to $x^*$ by an amount proportional to $\eta(f(x_k) - f^*)$ but increases it by $\eta^2 G^2$ from the noise-like subgradient variation. Balancing these at $\eta \sim 1/\sqrt{k}$ gives the $O(1/\sqrt{k})$ rate.

Proof Sketch

$\|x_{k+1} - x^*\|^2 = \|x_k - x^*\|^2 - 2\eta_k g_k^\top(x_k - x^*) + \eta_k^2\|g_k\|^2$ . By convexity, $g_k^\top(x_k - x^*) \geq f(x_k) - f^*$ . Summing over $k$ iterations and rearranging: $\sum_{k=0}^{K-1} \eta_k(f(x_k) - f^*) \leq \frac{1}{2}\|x_0 - x^*\|^2 + \frac{1}{2}\sum \eta_k^2 G^2$ . For $\eta_k = c/\sqrt{K}$ , both the left numerator and $\sum \eta_k$ scale as $\sqrt{K}$ , giving the $O(1/\sqrt{K})$ bound on the best iterate.

Why It Matters

The $O(1/\sqrt{k})$ rate is optimal among first-order methods for non-smooth convex optimization (Nemirovski-Yudin 1983). This gap from the $O(1/k)$ smooth rate is what motivates proximal methods: when the non-smooth term has a closed-form proximal operator (like the $\ell_1$ norm), proximal-gradient methods recover the smooth rate by handling non-smoothness exactly.

Failure Mode

The rate degrades with problem conditioning but cannot be improved without structural assumptions. For strongly convex $f$ , step sizes $\eta_k = 1/(Lk)$ give $O(1/k)$ , but for general non-smooth convex problems, $O(1/\sqrt{k})$ is tight. The method also requires knowing $G$ and $\|x_0 - x^*\|$ for optimal step-size selection; poor choices can slow convergence or cause divergence.

report a correction →

Common Confusions

Watch Out

Subgradient method is not a descent method

For smooth convex functions with gradient descent, every step reduces the objective. For non-smooth convex functions with the subgradient method, the objective can increase between iterations. The convergence guarantee is on the best iterate seen so far, $\min_{j \leq k} f(x_j)$ , not on $f(x_k)$ . This is why averaging schemes (Polyak averaging, Nemirovski-Yudin) are common in subgradient analyses.

Watch Out

Subgradients are only defined for convex functions

For non-convex functions, the convex-analytic subgradient does not exist in general. The Clarke subdifferential and the limiting subdifferential extend the concept to locally Lipschitz non-convex functions and provide necessary conditions for minimality, but the clean "iff global min" of the convex case is lost. ReLU networks are not convex, so the "subgradients" used in their training are Clarke subgradients (any element of the convex hull of limit-of-gradient sequences), not the subgradients defined here.

Watch Out

The subdifferential of a sum can be larger than the sum of subdifferentials

$\partial f + \partial g \subseteq \partial (f + g)$ , with equality only under a constraint qualification (relint domain intersection non-empty). Without that qualification, strict containment is possible.

The standard textbook example (Rockafellar 1970, §23) uses functions whose effective domains touch at a single boundary point. On $\mathbb{R}^2$ , let $f(x_1,x_2) = \begin{cases} -\sqrt{x_1 x_2} & x_1, x_2 \geq 0 \\ +\infty & \text{otherwise}\end{cases}, \qquad g(x_1,x_2) = -x_2.$ Then $\mathrm{dom}(f) = \{x_1, x_2 \geq 0\}$ and $\mathrm{dom}(g) = \mathbb{R}^2$ , so $\mathrm{relint}(\mathrm{dom}\, f) \cap \mathrm{relint}(\mathrm{dom}\, g) = \{x_1 > 0, x_2 > 0\} \neq \emptyset$ — the qualification does hold and sum equality is recovered. To see strict containment, take instead $f(x) = \delta_{\{x \leq 0\}}(x)$ (indicator of $(-\infty, 0]$ ) and $g(x) = \delta_{\{x \geq 0\}}(x)$ on $\mathbb{R}$ . Their domains intersect only at $\{0\}$ , with empty relative interior intersection, so the qualification fails. Both indicators have finite normal cones at $0$ : $\partial f(0) = [0, \infty)$ and $\partial g(0) = (-\infty, 0]$ , giving $\partial f(0) + \partial g(0) = \mathbb{R}$ . But $f + g = \delta_{\{0\}}$ , so $\partial(f+g)(0) = \mathbb{R}$ as well — equality, not strict containment, in this particular case. Genuine strict containment requires more delicate examples; see Rockafellar §23 Theorem 23.8 for one. The qualification "relint domains intersect" is what guarantees you do not need to worry about these pathologies.

Summary

A subgradient of a convex $f$ at $x$ is any $g$ such that $f(y) \geq f(x) + \langle g, y - x\rangle$ for all $y$ . The set of subgradients is the subdifferential $\partial f(x)$ .
Subdifferentials exist (non-empty, closed, convex, bounded) at every point in the relative interior of $\text{dom}\, f$ .
$x^*$ is a global minimizer of convex $f$ iff $0 \in \partial f(x^*)$ . This is the non-smooth analogue of $\nabla f(x^*) = 0$ .
Subgradient calculus rules (sum, chain, pointwise max) follow gradient rules with constraint-qualification caveats.
The subgradient method runs at $O(1/\sqrt{k})$ , slower than gradient descent on smooth functions; proximal methods recover $O(1/k)$ when the non-smooth term has a tractable proximal operator.

Exercises

ExerciseCore

Problem

Compute the subdifferential of $f(x) = \max(0, x)$ (the ReLU function viewed as a univariate convex function) at every point. Then verify the optimality condition $0 \in \partial f(x^*)$ for the global minimizer.

ExerciseAdvanced

Problem

Derive the soft-thresholding operator from the subdifferential optimality condition. Specifically, show that the unique solution to $\min_x \tfrac{1}{2}(x - z)^2 + \lambda |x|$ for $\lambda > 0$ is given by $x^* = \text{sign}(z) \max(|z| - \lambda, 0).$ This is the proximal operator of $\lambda |\cdot|$ at the point $z$ .

References

Canonical:

Rockafellar, "Convex Analysis" (Princeton, 1970), Sections 23-25
Hiriart-Urruty and Lemarechal, "Fundamentals of Convex Analysis" (Springer, 2001), Chapter D
Boyd and Vandenberghe, "Convex Optimization" (Cambridge, 2004), Section 3.1.5 (gradient and subgradient overlap), Appendix C

Convergence theory:

Nesterov, "Introductory Lectures on Convex Optimization" (Springer, 2004), Section 3.2 (subgradient method)
Bubeck, "Convex Optimization: Algorithms and Complexity" (Foundations and Trends in ML, 2015), Sections 3.1-3.2

ML applications:

Beck, "First-Order Methods in Optimization" (SIAM, 2017), Chapters 3, 6, 10 (proximal operators, subgradient computations for ML losses)
Parikh and Boyd, "Proximal Algorithms" (Foundations and Trends in Optimization, 2014)

Next Topics

Proximal gradient methods: how to use subgradient structure efficiently when the non-smooth term is "simple"
Lasso regression: the canonical application via soft-thresholding
Convex duality: KKT conditions are subgradient optimality conditions for the Lagrangian

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Convex Optimization Basicslayer 1 · tier 1

Derived topics

3

Convex Dualitylayer 2 · tier 1
Lasso Regressionlayer 2 · tier 1
Proximal Gradient Methodslayer 2 · tier 1

Graph-backed continuations

Proximal Gradient Methods Lasso Regression Convex Duality