Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Mathematical Infrastructure

Convex Duality

Fenchel conjugates, the Fenchel-Moreau theorem, weak and strong duality, KKT conditions, and why duality gives the kernel trick for SVMs, connects regularization to constraints, and enables adversarial formulations in DRO.

CoreTier 1Stable~75 min

Why This Matters

Duality is the tool that converts hard optimization problems into equivalent (or approximately equivalent) problems that are easier to solve, analyze, or interpret. In machine learning, duality is not an abstract luxury --- it is the engine behind some of the most important algorithms and insights:

  • SVMs: the dual of the SVM problem reveals the kernel trick, transforming an optimization in parameter space into one in the space of inner products between data points
  • Regularization: constrained ERM (minimize loss subject to a norm constraint) and regularized ERM (minimize loss plus a penalty) are dual to each other. The Lagrange multiplier is the regularization parameter
  • DRO: distributionally robust optimization uses duality to convert a minimax problem (worst-case over distributions) into a tractable regularized problem

If you do not understand duality, you cannot understand why the kernel trick works, why regularization is equivalent to constraining, or how adversarial formulations lead to tractable algorithms.

Primal (upper bound)Dual (lower bound)duality gapgap = 0strong dualityOptimization progressObjective value

Mental Model

Every convex minimization problem has a "shadow" problem --- its dual --- which is a maximization problem. The dual always provides a lower bound on the primal optimum (weak duality). Under mild conditions (strong duality), the two optima are equal. At the optimal point, the primal and dual variables satisfy complementary conditions (KKT).

The Fenchel conjugate ff^* is the fundamental building block: it converts a function into its "dual representation." For background on convex sets and functions, see convex optimization basics. The conjugate of the conjugate gives back the original function (for closed convex functions), which is the Fenchel-Moreau theorem.

Formal Setup

Definition

Fenchel Conjugate (Convex Conjugate)

The Fenchel conjugate (or convex conjugate) of a function f:RdR{+}f: \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\} is:

f(y)=supxRd{yxf(x)}f^*(y) = \sup_{x \in \mathbb{R}^d} \left\{ y^\top x - f(x) \right\}

The conjugate f:RdR{+}f^*: \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\} is always convex (as a supremum of affine functions in yy), even if ff is not convex.

Interpretation: f(y)f^*(y) measures the maximum "gap" between the linear function xyxx \mapsto y^\top x and f(x)f(x). Geometrically, f(y)f^*(y) is related to the supporting hyperplane of the epigraph of ff with slope yy.

Definition

Young-Fenchel Inequality

For any ff and its conjugate ff^*, and for all x,yx, y:

xyf(x)+f(y)x^\top y \leq f(x) + f^*(y)

This is immediate from the definition: f(y)yxf(x)f^*(y) \geq y^\top x - f(x). Equality holds if and only if yf(x)y \in \partial f(x) (the subdifferential of ff at xx).

Key conjugate pairs (you should know these):

f(x)f(x)f(y)f^*(y)
12x2\frac{1}{2}\|x\|^212y2\frac{1}{2}\|y\|^2
x\|x\| (any norm)δy1\delta_{\|y\|_* \leq 1} (indicator of dual norm ball)
12xAx\frac{1}{2}x^\top Ax (A0A \succ 0)12yA1y\frac{1}{2}y^\top A^{-1}y
exe^xylogyyy\log y - y for y>0y > 0, 00 for y=0y = 0, ++\infty for y<0y < 0
δC(x)\delta_C(x) (indicator of convex set)supxCyx\sup_{x \in C} y^\top x (support function)

Main Theorems

Theorem

Fenchel-Moreau Biconjugation Theorem

Statement

If f:RdR{+}f: \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\} is proper (not identically ++\infty and never -\infty), lower-semicontinuous, and convex, then:

f=ff^{**} = f

That is, the conjugate of the conjugate recovers the original function.

Intuition

A closed convex function is completely determined by its supporting hyperplanes. The Fenchel conjugate encodes these hyperplanes. Conjugating twice reconstructs the function from its hyperplane representation. This is analogous to the Fourier transform: applying it twice (with appropriate normalization) gives back the original function.

If ff is not convex or not lower-semicontinuous, then ff^{**} is the closed convex envelope of ff --- the largest closed convex function that is pointwise f\leq f.

Proof Sketch

One direction is easy: f(x)=supy{xyf(y)}f(x)f^{**}(x) = \sup_y \{x^\top y - f^*(y)\} \leq f(x) follows from the Young-Fenchel inequality (for each yy, xyf(y)f(x)x^\top y - f^*(y) \leq f(x)).

The other direction uses the supporting hyperplane theorem: for every x0x_0 and every α<f(x0)\alpha < f(x_0), the point (x0,α)(x_0, \alpha) lies below the epigraph of ff. Since epi(f)\text{epi}(f) is closed and convex, a separating hyperplane gives a yy with yx0f(y)αy^\top x_0 - f^*(y) \geq \alpha. Since α<f(x0)\alpha < f(x_0) is arbitrary, f(x0)f(x0)f^{**}(x_0) \geq f(x_0).

Why It Matters

Fenchel-Moreau is the theoretical foundation of all convex duality. It says that every closed convex function has a perfect "dual representation" via its conjugate. This enables:

  1. Converting between constrained and penalized formulations: the conjugate of an indicator function is a support function, and vice versa. This is exactly the duality between norm-constrained and norm-penalized optimization.

  2. Deriving dual problems: the Lagrangian dual of a convex problem can be written in terms of Fenchel conjugates. Strong duality (f=ff^{**} = f) ensures no duality gap.

  3. Variational representations: many quantities in information theory (KL divergence, entropy) have dual representations via Fenchel conjugates. The Donsker-Varadhan variational formula for KL divergence is a consequence.

Failure Mode

If ff is not closed (lower-semicontinuous) or not convex, then fff^{**} \neq f. The biconjugate ff^{**} will be the closed convex envelope, which can differ significantly. For example, if f(x)=xf(x) = -|x|, then ff is concave, and f(x)=f^{**}(x) = -\infty for all xx. Always verify closedness and convexity before applying Fenchel-Moreau.

Lagrangian Duality

Definition

Lagrangian and Dual Problem

Consider the primal problem:

minxf(x)subject to gi(x)0,  i=1,,m\min_x f(x) \quad \text{subject to } g_i(x) \leq 0, \; i = 1, \ldots, m

The Lagrangian is:

L(x,λ)=f(x)+i=1mλigi(x),λi0L(x, \lambda) = f(x) + \sum_{i=1}^m \lambda_i g_i(x), \quad \lambda_i \geq 0

The dual function is q(λ)=infxL(x,λ)q(\lambda) = \inf_x L(x, \lambda). The dual problem is:

maxλ0q(λ)\max_{\lambda \geq 0} q(\lambda)

The dual function qq is always concave (as an infimum of affine functions in λ\lambda), regardless of whether the primal is convex.

Definition

Weak and Strong Duality

Let pp^* be the primal optimal value and dd^* the dual optimal value.

Weak duality (always holds): dpd^* \leq p^*. The dual provides a lower bound on the primal.

Strong duality: d=pd^* = p^*. The duality gap is zero.

Strong duality holds for convex problems under Slater's condition: there exists a strictly feasible point xˉ\bar{x} with gi(xˉ)<0g_i(\bar{x}) < 0 for all ii.

Theorem

Strong Duality via Slater's Condition

Statement

If ff and g1,,gmg_1, \ldots, g_m are convex and there exists a strictly feasible point xˉ\bar{x} with gi(xˉ)<0g_i(\bar{x}) < 0 for all ii, then strong duality holds: the primal and dual optimal values are equal, and the dual optimum is attained.

Intuition

Slater's condition says the feasible region has a non-empty interior (the constraints are not "barely" satisfied). This prevents pathological cases where the primal feasible set is "too thin" for duality to work. In practice, Slater's condition holds for almost every convex optimization problem you encounter in ML.

Proof Sketch

Consider the set V={(u,t):x,  gi(x)ui,  f(x)t}Rm+1\mathcal{V} = \{(u, t) : \exists x, \; g_i(x) \leq u_i, \; f(x) \leq t\} \subseteq \mathbb{R}^{m+1}. This set is convex (because ff and gig_i are convex). The point (0,p)(0, p^*) is on the boundary of V\mathcal{V}. By the supporting hyperplane theorem, there exists a hyperplane separating (0,p)(0, p^*) from the interior of V\mathcal{V}. This hyperplane defines the optimal dual variables λ\lambda^*. Slater's condition ensures the hyperplane has the right orientation (the λ\lambda components are non-negative), which gives q(λ)=pq(\lambda^*) = p^*.

Why It Matters

Strong duality is what makes dual methods actually useful:

  • SVM dual: the primal SVM minimizes over ww (potentially high-dimensional). The dual minimizes over αi\alpha_i (one per data point). Under strong duality, both give the same answer. The dual depends on data only through inner products xixjx_i^\top x_j, enabling the kernel trick: replace inner products with k(xi,xj)k(x_i, x_j).

  • Regularization duality: minimizing f(x)f(x) subject to xr\|x\| \leq r is equivalent (by strong duality) to minimizing f(x)+λxf(x) + \lambda\|x\| for some λ0\lambda \geq 0. The Lagrange multiplier λ\lambda is the regularization parameter. This is why constraint-based and penalty-based regularization are interchangeable.

  • DRO: worst-case expected loss over a ball of distributions can be dualized into a regularized empirical risk problem. This converts an intractable minimax problem into a tractable convex program.

Failure Mode

Without Slater's condition, strong duality can fail. Example: minimize xx subject to x20x^2 \leq 0. The only feasible point is x=0x = 0 (primal value p=0p^* = 0). But the Lagrangian L(x,λ)=x+λx2L(x, \lambda) = x + \lambda x^2 gives q(λ)=infx(x+λx2)q(\lambda) = \inf_x (x + \lambda x^2). For λ>0\lambda > 0: q(λ)=1/(4λ)q(\lambda) = -1/(4\lambda). So d=supλ0q(λ)=0d^* = \sup_{\lambda \geq 0} q(\lambda) = 0 (approached but not attained). In this case strong duality technically holds, but consider modified examples where the constraint is an equality on a convex function: Slater's condition fails and a gap can appear.

KKT Conditions

Definition

Karush-Kuhn-Tucker (KKT) Conditions

For the convex problem minxf(x)\min_x f(x) subject to gi(x)0g_i(x) \leq 0, under strong duality, the primal-dual pair (x,λ)(x^*, \lambda^*) is optimal if and only if:

  1. Primal feasibility: gi(x)0g_i(x^*) \leq 0 for all ii
  2. Dual feasibility: λi0\lambda_i^* \geq 0 for all ii
  3. Stationarity: 0f(x)+iλigi(x)0 \in \partial f(x^*) + \sum_i \lambda_i^* \partial g_i(x^*)
  4. Complementary slackness: λigi(x)=0\lambda_i^* g_i(x^*) = 0 for all ii

Complementary slackness says: either a constraint is tight (gi(x)=0g_i(x^*) = 0) or its multiplier is zero (λi=0\lambda_i^* = 0). Active constraints "matter"; inactive constraints are irrelevant. In SVMs, the data points with λi>0\lambda_i^* > 0 are the support vectors.

Sion Minimax Theorem

Definition

Sion Minimax Theorem

If X\mathcal{X} is convex and compact, Y\mathcal{Y} is convex, and ϕ(x,y)\phi(x, y) is convex-concave (convex in xx for each yy, concave in yy for each xx) and lower-semicontinuous in xx, upper-semicontinuous in yy, then:

minxXsupyYϕ(x,y)=supyYminxXϕ(x,y)\min_{x \in \mathcal{X}} \sup_{y \in \mathcal{Y}} \phi(x, y) = \sup_{y \in \mathcal{Y}} \min_{x \in \mathcal{X}} \phi(x, y)

The min and sup can be interchanged. This is a generalization of von Neumann's minimax theorem for zero-sum games.

Why it matters for ML: The minimax theorem underlies adversarial formulations. In GANs, DRO, and robust optimization, you want to swap a min over model parameters with a max over adversarial perturbations. The Sion theorem tells you when this swap is valid.

Canonical Examples

Example

Conjugate of the squared norm

f(x)=12x2f(x) = \frac{1}{2}\|x\|^2. Then:

f(y)=supx{yx12x2}f^*(y) = \sup_x \{y^\top x - \frac{1}{2}\|x\|^2\}

Taking the gradient and setting to zero: yx=0y - x = 0, so x=yx^* = y. Substituting: f(y)=yy12y2=12y2f^*(y) = y^\top y - \frac{1}{2}\|y\|^2 = \frac{1}{2}\|y\|^2.

The squared norm is its own conjugate. This self-duality is why 2\ell_2 regularization leads to particularly clean dual problems (e.g., ridge regression).

Example

Duality between constrained and penalized optimization

Primal (constrained): minwf(w)\min_w f(w) subject to wr\|w\| \leq r

Lagrangian: L(w,λ)=f(w)+λ(wr)L(w, \lambda) = f(w) + \lambda(\|w\| - r)

Dual: maxλ0{infw[f(w)+λw]λr}\max_{\lambda \geq 0} \{\inf_w [f(w) + \lambda\|w\|] - \lambda r\}

The inner minimization infw[f(w)+λw]\inf_w [f(w) + \lambda\|w\|] is exactly the penalized problem with regularization parameter λ\lambda.

By strong duality, there exists λ\lambda^* such that the constrained problem with radius rr and the penalized problem with parameter λ\lambda^* have the same solution. This is the rigorous justification for the equivalence between norm-constrained and norm-penalized regularization.

Common Confusions

Watch Out

Weak duality always holds; strong duality requires conditions

Students sometimes assume that the primal and dual always have the same optimal value. Weak duality (dpd^* \leq p^*) is trivially true, but strong duality (d=pd^* = p^*) requires assumptions like convexity plus Slater's condition. For non-convex problems, the duality gap can be arbitrarily large. Always check the conditions before claiming strong duality.

Watch Out

The dual problem is always concave, even if the primal is non-convex

The dual function q(λ)=infxL(x,λ)q(\lambda) = \inf_x L(x, \lambda) is concave in λ\lambda regardless of whether ff or the gig_i are convex. This is because qq is a pointwise infimum of affine functions in λ\lambda. The dual is always a concave maximization problem, which is computationally tractable. The catch: for non-convex primal problems, d<pd^* < p^* and the dual only gives a lower bound.

Watch Out

KKT conditions are necessary and sufficient only under strong duality

For general non-convex problems, KKT is only necessary (assuming constraint qualification). For convex problems with strong duality, KKT is both necessary and sufficient. In ML, most problems of interest (SVMs, regularized linear models, DRO) are convex and satisfy Slater's condition, so KKT gives exact optimality conditions.

Summary

  • Fenchel conjugate: f(y)=supx{yxf(x)}f^*(y) = \sup_x \{y^\top x - f(x)\}
  • Young-Fenchel inequality: xyf(x)+f(y)x^\top y \leq f(x) + f^*(y), with equality iff yf(x)y \in \partial f(x)
  • Fenchel-Moreau: f=ff^{**} = f for closed convex functions
  • Weak duality always holds (dpd^* \leq p^*); strong duality requires convexity + Slater's condition
  • KKT conditions: primal feasibility, dual feasibility, stationarity, complementary slackness
  • Duality converts constrained problems into penalized problems (regularization = Lagrangian duality)
  • SVM dual reveals the kernel trick; DRO dual yields tractable regularization

Exercises

ExerciseCore

Problem

Compute the Fenchel conjugate of f(x)=x1f(x) = \|x\|_1 (the 1\ell_1 norm on Rd\mathbb{R}^d). What is the dual norm of 1\ell_1?

ExerciseAdvanced

Problem

Derive the dual of the soft-margin SVM problem:

minw,b,ξ12w2+Ciξis.t. yi(wxi+b)1ξi,  ξi0\min_{w, b, \xi} \frac{1}{2}\|w\|^2 + C\sum_i \xi_i \quad \text{s.t. } y_i(w^\top x_i + b) \geq 1 - \xi_i, \; \xi_i \geq 0

Show that the dual depends on the data only through inner products xixjx_i^\top x_j, which enables the kernel trick.

Related Comparisons

References

Canonical:

  • Rockafellar, Convex Analysis (1970), Chapters 12, 26-28, 31
  • Boyd & Vandenberghe, Convex Optimization (2004), Chapter 5
  • Borwein & Lewis, Convex Analysis and Nonlinear Optimization (2nd ed., 2006)

Current:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 15 (SVM duality)
  • Ben-Tal, El Ghaoui, Nemirovski, Robust Optimization (2009), Chapter 4

Next Topics

Building on convex duality:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics