Skip to main content

Optimization Function Classes

Subgradients and Subdifferentials

The non-smooth generalization of the gradient for convex functions. Subgradients enable optimality conditions, calculus rules, and convergence guarantees for L1-regularized problems, hinge loss SVMs, and proximal algorithms where the objective is not differentiable.

CoreTier 1Stable~35 min
0

Why This Matters

Most modern ML objectives are not smooth. The L1 penalty in lasso regression is not differentiable at zero. The hinge loss in support vector machines kinks at the margin. The ReLU activation has no derivative at zero. The total-variation regularizer has gradient discontinuities at every level set.

Standard gradient-based optimization assumes the gradient exists, which is exactly what fails on these problems. The right replacement is the subgradient, which generalizes the gradient to convex but non-smooth functions. Every convex function has a non-empty subdifferential at every interior point of its domain, even where the gradient does not exist. This is what makes proximal gradient methods, the lasso, hinge-loss SVMs, and ADMM work as theory rather than heuristics.

This page is the foundational reference. Convex-optimization-basics covers the smooth case; this page covers the non-smooth case that nearly every sparsity-inducing or regularized ML method actually uses.

Definition

Definition

Subgradient

Let f:RdR{+}f : \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\} be convex. A vector gRdg \in \mathbb{R}^d is a subgradient of ff at xx if f(y)f(x)+g,yxfor all yRd.f(y) \geq f(x) + \langle g, y - x \rangle \quad \text{for all } y \in \mathbb{R}^d. The set of all subgradients of ff at xx is the subdifferential: f(x)={gRd:f(y)f(x)+g,yx y}.\partial f(x) = \{g \in \mathbb{R}^d : f(y) \geq f(x) + \langle g, y - x\rangle \ \forall y\}.

A subgradient is any vector gg such that the affine function yf(x)+g,yxy \mapsto f(x) + \langle g, y - x\rangle stays below ff everywhere and touches it at xx. This is the convex-analytic generalization of the tangent line: at a smooth convex function, only one such affine support exists (the tangent), and g=f(x)g = \nabla f(x). At a kink, an entire family of supporting affine functions exists, and f(x)\partial f(x) is the set of their slopes.

Examples

Absolute value f(x)=xf(x) = |x| on R\mathbb{R}.

f(x)={{1}x>0[1,1]x=0{1}x<0\partial f(x) = \begin{cases} \{1\} & x > 0 \\ [-1, 1] & x = 0 \\ \{-1\} & x < 0 \end{cases}

Away from zero, ff is differentiable and the subdifferential is the singleton containing the derivative. At zero, every slope in [1,1][-1, 1] gives a supporting line that stays below x|x|. The subdifferential at the kink is the closed interval of all such slopes.

1\ell_1 norm f(x)=x1=ixif(x) = \|x\|_1 = \sum_i |x_i| on Rd\mathbb{R}^d.

f(x)={gRd:gi=sign(xi) if xi0, gi[1,1] if xi=0}.\partial f(x) = \{g \in \mathbb{R}^d : g_i = \text{sign}(x_i) \text{ if } x_i \neq 0, \ g_i \in [-1, 1] \text{ if } x_i = 0\}.

Coordinate-wise: each gig_i is the sign of xix_i at non-zero coordinates and any value in [1,1][-1, 1] at zero coordinates. This is the subdifferential that drives soft-thresholding in proximal methods.

Hinge loss f(x)=max(0,1x)f(x) = \max(0, 1 - x).

f(x)={{1}x<1[1,0]x=1{0}x>1\partial f(x) = \begin{cases} \{-1\} & x < 1 \\ [-1, 0] & x = 1 \\ \{0\} & x > 1 \end{cases}

The slope is 1-1 in the loss region, 00 in the no-loss region, and the entire interval [1,0][-1, 0] at the kink. The kink at x=1x = 1 is exactly the SVM margin.

Indicator function δC\delta_C of a closed convex set CC.

δC(x)={g:g,yx0 yC}=NC(x)\partial \delta_C(x) = \{g : \langle g, y - x\rangle \leq 0 \ \forall y \in C\} = N_C(x)

This is the normal cone to CC at xx: the set of outward-pointing directions that do not enter CC. At interior points, NC(x)={0}N_C(x) = \{0\}; on the boundary, the normal cone is non-trivial. This is how convex constraints enter optimality conditions.

Smooth convex ff. f(x)={f(x)}\partial f(x) = \{\nabla f(x)\} at every point of differentiability.

Existence

Theorem

Existence of Subgradients

Statement

For any proper convex f:RdR{+}f : \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\} and any point xx in the relative interior of domf\text{dom}\, f, the subdifferential f(x)\partial f(x) is non-empty, closed, convex, and bounded.

If ff is differentiable at xx, then f(x)={f(x)}\partial f(x) = \{\nabla f(x)\}.

If xx is on the boundary of domf\text{dom}\, f or ff is not lower semicontinuous at xx, the subdifferential may be empty.

Intuition

Geometrically: a convex function has supporting hyperplanes at every interior point of its domain (this is the supporting hyperplane theorem applied to the epigraph). Each supporting hyperplane has a normal vector whose first dd coordinates form a subgradient. The set of such normals forms a closed convex bounded set, which is exactly the subdifferential.

Proof Sketch

Apply the supporting hyperplane theorem to the epigraph epif={(x,t):tf(x)}\text{epi}\, f = \{(x, t) : t \geq f(x)\} at the boundary point (x,f(x))(x, f(x)). The supporting hyperplane has normal (g,1)(g, -1) for some gRdg \in \mathbb{R}^d (the 1-1 is forced by the epigraph being "tt-monotone"). The supporting condition becomes f(y)f(x)g,yxf(y) - f(x) \geq \langle g, y - x\rangle for all yy, which is exactly the subgradient inequality. Closure and convexity of f(x)\partial f(x) follow because it is an intersection of half-spaces. Boundedness on the interior follows from ff being locally Lipschitz on the relative interior of its domain.

Why It Matters

Existence on the interior is what licenses every algorithm that picks "any subgradient" at each step. Without existence you could not even define a subgradient method, let alone analyze it. The boundary caveat matters for indicator functions of constraints, where the subdifferential at constraint boundaries is the normal cone (which can be unbounded).

Failure Mode

The subdifferential is empty at points outside domf\text{dom}\, f, and can be empty at boundary points where ff jumps from finite to ++\infty. Example: f(x)=xf(x) = -\sqrt{x} for x0x \geq 0, f(x)=+f(x) = +\infty for x<0x < 0. At x=0x = 0, every "subgradient" gg would need ygy-\sqrt{y} \geq g \cdot y for all y0y \geq 0, which forces gy/yg \leq -\sqrt{y}/y \to -\infty as y0+y \to 0^+. No finite gg works, so f(0)=\partial f(0) = \emptyset.

Optimality Condition

The single most important consequence of the subdifferential framework: it gives a clean optimality condition for non-smooth convex minimization.

Theorem

Subdifferential Optimality Condition

Statement

xx^* is a global minimizer of ff if and only if 0f(x).0 \in \partial f(x^*). This is the non-smooth analogue of f(x)=0\nabla f(x^*) = 0 for smooth convex functions.

Intuition

The subgradient inequality f(y)f(x)+g,yxf(y) \geq f(x^*) + \langle g, y - x^*\rangle with g=0g = 0 becomes f(y)f(x)f(y) \geq f(x^*) for all yy, which is exactly the definition of xx^* being a global minimizer. Conversely, if xx^* is a global minimizer, the constant subgradient g=0g = 0 satisfies the defining inequality.

Proof Sketch

If 0f(x)0 \in \partial f(x^*): By the subgradient inequality at g=0g = 0, f(y)f(x)+0=f(x)f(y) \geq f(x^*) + 0 = f(x^*) for all yy. So xx^* is a global minimizer.

If xx^* is a global minimizer: Take g=0g = 0. Then f(y)f(x)=f(x)+0,yxf(y) \geq f(x^*) = f(x^*) + \langle 0, y - x^*\rangle holds for all yy, so 0f(x)0 \in \partial f(x^*).

Why It Matters

This is the optimality condition used to derive the soft-thresholding operator (the proximal operator of the 1\ell_1 norm), to prove KKT conditions for constrained problems, and to characterize lasso solutions. For lasso minw12yXw22+λw1\min_w \tfrac{1}{2} \|y - Xw\|_2^2 + \lambda \|w\|_1, the optimality condition 0f(w)0 \in \partial f(w^*) becomes X(yXw)λw1X^\top(y - Xw^*) \in \lambda \, \partial \|w^*\|_1, which gives the explicit characterization that Xj(yXw)λ|X_j^\top(y - Xw^*)| \leq \lambda for zero coordinates of ww^* and equality with the right sign for non-zero coordinates.

Failure Mode

This optimality condition characterizes global minimizers; for non-convex ff, 0f(x)0 \in \partial f(x^*) (with \partial generalized to Clarke subgradients) is necessary but not sufficient for local minimality, and saddle points satisfy it. The full strength of "iff global minimum" requires convexity.

Subgradient Calculus

The subdifferential satisfies calculus rules analogous to gradient rules, with a few caveats around equality versus containment.

Sum rule. For convex f,gf, g, f(x)+g(x)(f+g)(x)\partial f(x) + \partial g(x) \subseteq \partial(f + g)(x), with equality if relint(domf)relint(domg)\text{relint}(\text{dom}\, f) \cap \text{relint}(\text{dom}\, g) \neq \emptyset.

Scalar multiplication. For α>0\alpha > 0, (αf)(x)=αf(x)\partial(\alpha f)(x) = \alpha \, \partial f(x).

Affine pre-composition. For ff convex and ARm×dA \in \mathbb{R}^{m \times d}, bRmb \in \mathbb{R}^m, (f(Ax+b))(x)=Af(Ax+b)\partial(f(Ax + b))(x) = A^\top \partial f(Ax + b).

Pointwise maximum. For convex f1,,fmf_1, \ldots, f_m and f(x)=maxifi(x)f(x) = \max_i f_i(x), f(x)=conv ⁣(iI(x)fi(x))\partial f(x) = \text{conv}\!\left(\bigcup_{i \in I(x)} \partial f_i(x)\right) where I(x)={i:fi(x)=f(x)}I(x) = \{i : f_i(x) = f(x)\} is the set of "active" functions at xx.

The maximum-rule is what gives the subdifferential of the hinge loss and the dual norm, and it is the source of the convex hull appearing in many non-smooth optimality conditions.

The Subgradient Method

The simplest non-smooth optimization algorithm replaces the gradient in gradient descent with any subgradient.

Algorithm. Choose step sizes ηk>0\eta_k > 0. At iteration kk:

  1. Pick any gkf(xk)g_k \in \partial f(x_k).
  2. Update xk+1=xkηkgkx_{k+1} = x_k - \eta_k \, g_k.

Unlike gradient descent on smooth functions, the subgradient method is not a descent method: f(xk+1)f(x_{k+1}) can exceed f(xk)f(x_k) even with optimal step sizes. The standard guarantee is on the best iterate.

Convergence. For ff convex with subgradients of bounded norm gG\|g\| \leq G, and step sizes ηk=c/k\eta_k = c / \sqrt{k}: min0jkf(xj)f=O ⁣(1k).\min_{0 \leq j \leq k} f(x_j) - f^* = O\!\left(\frac{1}{\sqrt{k}}\right).

This O(1/k)O(1/\sqrt{k}) rate is provably slower than the O(1/k)O(1/k) rate of gradient descent on smooth convex functions, and the gap cannot be closed without exploiting structure. The slower rate is what motivates proximal methods: when the non-smooth term is a "simple" function (like the 1\ell_1 norm) whose proximal operator has a closed form, proximal-gradient methods recover the O(1/k)O(1/k) rate by handling the smooth part with gradient descent and the non-smooth part exactly.

Common Confusions

Watch Out

Subgradient method is not a descent method

For smooth convex functions with gradient descent, every step reduces the objective. For non-smooth convex functions with the subgradient method, the objective can increase between iterations. The convergence guarantee is on the best iterate seen so far, minjkf(xj)\min_{j \leq k} f(x_j), not on f(xk)f(x_k). This is why averaging schemes (Polyak averaging, Nemirovski-Yudin) are common in subgradient analyses.

Watch Out

Subgradients are only defined for convex functions

For non-convex functions, the convex-analytic subgradient does not exist in general. The Clarke subdifferential and the limiting subdifferential extend the concept to locally Lipschitz non-convex functions and provide necessary conditions for minimality, but the clean "iff global min" of the convex case is lost. ReLU networks are not convex, so the "subgradients" used in their training are Clarke subgradients (any element of the convex hull of limit-of-gradient sequences), not the subgradients defined here.

Watch Out

The subdifferential of a sum can be larger than the sum of subdifferentials

f+g(f+g)\partial f + \partial g \subseteq \partial (f + g), with equality only under a constraint qualification (relint domain intersection non-empty). Without it, you can have a strict containment. Example: take f(x)=xf(x) = -x on (,0](-\infty, 0] extended by ++\infty, and g(x)=xg(x) = x on [0,)[0, \infty) extended by ++\infty. Then f+gf + g is the indicator of {0}\{0\}, and (f+g)(0)=R\partial(f + g)(0) = \mathbb{R}. But f(0)+g(0)\partial f(0) + \partial g(0) is computed at the boundary of each domain, where neither is finite. The qualification "relint domains intersect" rules this out.

Summary

  • A subgradient of a convex ff at xx is any gg such that f(y)f(x)+g,yxf(y) \geq f(x) + \langle g, y - x\rangle for all yy. The set of subgradients is the subdifferential f(x)\partial f(x).
  • Subdifferentials exist (non-empty, closed, convex, bounded) at every point in the relative interior of domf\text{dom}\, f.
  • xx^* is a global minimizer of convex ff iff 0f(x)0 \in \partial f(x^*). This is the non-smooth analogue of f(x)=0\nabla f(x^*) = 0.
  • Subgradient calculus rules (sum, chain, pointwise max) follow gradient rules with constraint-qualification caveats.
  • The subgradient method runs at O(1/k)O(1/\sqrt{k}), slower than gradient descent on smooth functions; proximal methods recover O(1/k)O(1/k) when the non-smooth term has a tractable proximal operator.

Exercises

ExerciseCore

Problem

Compute the subdifferential of f(x)=max(0,x)f(x) = \max(0, x) (the ReLU function viewed as a univariate convex function) at every point. Then verify the optimality condition 0f(x)0 \in \partial f(x^*) for the global minimizer.

ExerciseAdvanced

Problem

Derive the soft-thresholding operator from the subdifferential optimality condition. Specifically, show that the unique solution to minx12(xz)2+λx\min_x \tfrac{1}{2}(x - z)^2 + \lambda |x| for λ>0\lambda > 0 is given by x=sign(z)max(zλ,0).x^* = \text{sign}(z) \max(|z| - \lambda, 0). This is the proximal operator of λ\lambda |\cdot| at the point zz.

References

Canonical:

  • Rockafellar, "Convex Analysis" (Princeton, 1970), Sections 23-25
  • Hiriart-Urruty and Lemarechal, "Fundamentals of Convex Analysis" (Springer, 2001), Chapter D
  • Boyd and Vandenberghe, "Convex Optimization" (Cambridge, 2004), Section 3.1.5 (gradient and subgradient overlap), Appendix C

Convergence theory:

  • Nesterov, "Introductory Lectures on Convex Optimization" (Springer, 2004), Section 3.2 (subgradient method)
  • Bubeck, "Convex Optimization: Algorithms and Complexity" (Foundations and Trends in ML, 2015), Sections 3.1-3.2

ML applications:

  • Beck, "First-Order Methods in Optimization" (SIAM, 2017), Chapters 3, 6, 10 (proximal operators, subgradient computations for ML losses)
  • Parikh and Boyd, "Proximal Algorithms" (Foundations and Trends in Optimization, 2014)

Next Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics