Convex Duality

Sneiderman, Robby

Mathematical Infrastructure

Convex Duality

Fenchel conjugates, the Fenchel-Moreau theorem, weak and strong duality, KKT conditions, and why duality gives the kernel trick for SVMs, connects regularization to constraints, and enables adversarial formulations in DRO.

CoreTier 1StableCore spine~75 min

Prerequisites

Convex Optimization Basics Inverse and Implicit Function Theorem Subgradients and Subdifferentials

Start 8-question practice · 10 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

mathematical-infrastructure | layer 2 | tier 1. This page has 3 direct prerequisites and 10 published dependents.

Open Atlas Prerequisites Leads to

What next

Support Vector Machines

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

KKT optimum: ∇f(x*) + λ*∇g(x*) = 0 with λ* > 0 — gradients anti-parallel on the active constraint.

Blue dashed circles: level sets of the objective f. Green line: active constraint g(x) = 0. At x*, −∇f and λ*∇g are parallel — the KKT stationarity condition. λ* > 0 because the constraint is active.

Duality is the tool that converts hard optimization problems into equivalent (or approximately equivalent) problems that are easier to solve, analyze, or interpret. In machine learning, duality is not an abstract luxury --- it is the engine behind some of the most important algorithms and insights:

SVMs: the dual of the SVM problem reveals the kernel trick, transforming an optimization in parameter space into one in the space of inner products between data points
Regularization: constrained ERM (minimize loss subject to a norm constraint) and regularized ERM (minimize loss plus a penalty) are dual to each other. The Lagrange multiplier is the regularization parameter
DRO: distributionally robust optimization uses duality to convert a minimax problem (worst-case over distributions) into a tractable regularized problem

If you do not understand duality, you cannot understand why the kernel trick works, why regularization is equivalent to constraining, or how adversarial formulations lead to tractable algorithms.

theorem visual

Duality Ledger

$Dual variables act like certificates: they can never overstate the primal optimum, and under strong duality they match it exactly.$

primal trackhigher

dual certificatelower

The remaining duality gap marks how much room still separates the certificate from the primal optimum.

Certificate

d^{⋆} \leq p^{⋆}

The dual stays a valid lower bound, but the primal optimum can still sit strictly above it.

primal problem: constrained or regularized ERM

dual problem: lower-bound certificate in multiplier space

bridge: Fenchel conjugates and Lagrange multipliers

Mental Model

Every convex minimization problem has a "shadow" problem --- its dual --- which is a maximization problem. The dual always provides a lower bound on the primal optimum (weak duality). Under mild conditions (strong duality), the two optima are equal. At the optimal point, the primal and dual variables satisfy complementary conditions (KKT).

The Fenchel conjugate $f^*$ is the fundamental building block: it converts a function into its "dual representation." For background on convex sets and functions, see convex optimization basics. The conjugate of the conjugate gives back the original function (for closed convex functions), which is the Fenchel-Moreau theorem.

Formal Setup

Definition

Fenchel Conjugate (Convex Conjugate) $f^{*} (y)$

The Fenchel conjugate (or convex conjugate) of a function $f: \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\}$ is:

$f^*(y) = \sup_{x \in \mathbb{R}^d} \left\{ y^\top x - f(x) \right\}$

The conjugate $f^*: \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\}$ is always convex (as a supremum of affine functions in $y$ ), even if $f$ is not convex.

Interpretation: $f^*(y)$ measures the maximum "gap" between the linear function $x \mapsto y^\top x$ and $f(x)$ . Geometrically, $f^*(y)$ is related to the supporting hyperplane of the epigraph of $f$ with slope $y$ .

Definition

Young-Fenchel Inequality

For any $f$ and its conjugate $f^*$ , and for all $x, y$ :

$x^\top y \leq f(x) + f^*(y)$

This is immediate from the definition: $f^*(y) \geq y^\top x - f(x)$ . Equality holds if and only if $y \in \partial f(x)$ (the subdifferential of $f$ at $x$ ).

Key conjugate pairs (you should know these):

$f(x)$	$f^*(y)$
$\frac{1}{2}\\|x\\|^2$	$\frac{1}{2}\\|y\\|^2$
$\\|x\\|$ (any norm)	$\delta_{\\|y\\|_* \leq 1}$ (indicator of dual norm ball)
$\frac{1}{2}x^\top Ax$ ( $A \succ 0$ )	$\frac{1}{2}y^\top A^{-1}y$
$e^x$	$y\log y - y$ for $y > 0$ , $0$ for $y = 0$ , $+\infty$ for $y < 0$
$\delta_C(x)$ (indicator of convex set)	$\sup_{x \in C} y^\top x$ (support function)

Main Theorems

Theorem

Fenchel-Moreau Biconjugation Theorem

Statement

If $f: \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\}$ is proper (not identically $+\infty$ and never $-\infty$ ), lower-semicontinuous, and convex, then:

$f^{**} = f$

That is, the conjugate of the conjugate recovers the original function.

Intuition

A closed convex function is completely determined by its supporting hyperplanes. The Fenchel conjugate encodes these hyperplanes. Conjugating twice reconstructs the function from its hyperplane representation. This is analogous to the Fourier transform: applying it twice (with appropriate normalization) gives back the original function.

If $f$ is not convex or not lower-semicontinuous, then $f^{**}$ is the closed convex envelope of $f$ --- the largest closed convex function that is pointwise $\leq f$ .

Proof Sketch

One direction is easy: $f^{**}(x) = \sup_y \{x^\top y - f^*(y)\} \leq f(x)$ follows from the Young-Fenchel inequality (for each $y$ , $x^\top y - f^*(y) \leq f(x)$ ).

The other direction uses the supporting hyperplane theorem: for every $x_0$ and every $\alpha < f(x_0)$ , the point $(x_0, \alpha)$ lies below the epigraph of $f$ . Since $\text{epi}(f)$ is closed and convex, a separating hyperplane gives a $y$ with $y^\top x_0 - f^*(y) \geq \alpha$ . Since $\alpha < f(x_0)$ is arbitrary, $f^{**}(x_0) \geq f(x_0)$ .

Why It Matters

Fenchel-Moreau is the theoretical foundation of all convex duality. It says that every closed convex function has a perfect "dual representation" via its conjugate. This enables:

Converting between constrained and penalized formulations: the conjugate of an indicator function is a support function, and vice versa. This is exactly the duality between norm-constrained and norm-penalized optimization.
Deriving dual problems: the Lagrangian dual of a convex problem can be written in terms of Fenchel conjugates. Strong duality ( $f^{**} = f$ ) ensures no duality gap.
Variational representations: many quantities in information theory (KL divergence, entropy) have dual representations via Fenchel conjugates. The Donsker-Varadhan variational formula for KL divergence is a consequence.

Failure Mode

If $f$ is not closed (lower-semicontinuous) or not convex, then $f^{**} \neq f$ . The biconjugate $f^{**}$ will be the closed convex envelope, which can differ significantly. For example, if $f(x) = -|x|$ , then $f$ is concave, and $f^{**}(x) = -\infty$ for all $x$ . Always verify closedness and convexity before applying Fenchel-Moreau.

report a correction →

Lagrangian Duality

Definition

Lagrangian and Dual Problem

Consider the primal problem:

$\min_x f(x) \quad \text{subject to } g_i(x) \leq 0, \; i = 1, \ldots, m$

The Lagrangian is:

$L(x, \lambda) = f(x) + \sum_{i=1}^m \lambda_i g_i(x), \quad \lambda_i \geq 0$

The dual function is $q(\lambda) = \inf_x L(x, \lambda)$ . The dual problem is:

$\max_{\lambda \geq 0} q(\lambda)$

The dual function $q$ is always concave (as an infimum of affine functions in $\lambda$ ), regardless of whether the primal is convex.

Definition

Weak and Strong Duality

Let $p^*$ be the primal optimal value and $d^*$ the dual optimal value.

Weak duality (always holds): $d^* \leq p^*$ . The dual provides a lower bound on the primal.

Strong duality: $d^* = p^*$ . The duality gap is zero.

Strong duality holds for convex problems under Slater's condition: there exists a strictly feasible point $\bar{x}$ with $g_i(\bar{x}) < 0$ for all $i$ .

Theorem

Strong Duality via Slater's Condition

Statement

If $f$ and $g_1, \ldots, g_m$ are convex and there exists a strictly feasible point $\bar{x}$ with $g_i(\bar{x}) < 0$ for all $i$ , then strong duality holds: the primal and dual optimal values are equal, and the dual optimum is attained.

Intuition

Slater's condition says the feasible region has a non-empty interior (the constraints are not "barely" satisfied). This prevents pathological cases where the primal feasible set is "too thin" for duality to work. In practice, Slater's condition holds for almost every convex optimization problem you encounter in ML.

Proof Sketch

Consider the set $\mathcal{V} = \{(u, t) : \exists x, \; g_i(x) \leq u_i, \; f(x) \leq t\} \subseteq \mathbb{R}^{m+1}$ . This set is convex (because $f$ and $g_i$ are convex). The point $(0, p^*)$ is on the boundary of $\mathcal{V}$ . By the supporting hyperplane theorem, there exists a hyperplane separating $(0, p^*)$ from the interior of $\mathcal{V}$ . This hyperplane defines the optimal dual variables $\lambda^*$ . Slater's condition ensures the hyperplane has the right orientation (the $\lambda$ components are non-negative), which gives $q(\lambda^*) = p^*$ .

Why It Matters

Strong duality is what makes dual methods actually useful:

SVM dual: the primal SVM minimizes over $w$ (potentially high-dimensional). The dual minimizes over $\alpha_i$ (one per data point). Under strong duality, both give the same answer. The dual depends on data only through inner products $x_i^\top x_j$ , enabling the kernel trick: replace inner products with $k(x_i, x_j)$ .
Regularization duality: minimizing $f(x)$ subject to $\|x\| \leq r$ is equivalent (by strong duality) to minimizing $f(x) + \lambda\|x\|$ for some $\lambda \geq 0$ . The Lagrange multiplier $\lambda$ is the regularization parameter. This is why constraint-based and penalty-based regularization are interchangeable.
DRO: worst-case expected loss over a ball of distributions can be dualized into a regularized empirical risk problem. This converts an intractable minimax problem into a tractable convex program.

The Legendre transform that underlies conjugacy also appears in information geometry: the natural-gradient and mirror-descent equivalence is a duality between natural parameters $\theta$ and expectation parameters $\eta$ mediated by the Legendre transform of the log-partition function.

Failure Mode

Without Slater's condition, strong duality can fail with a strictly positive gap. Consider the convex problem:

$\min_{x, y} \; e^{-x} \quad \text{s.t.} \quad \frac{x^2}{y} \leq 0, \quad y > 0$

The constraint function $f(x, y) = x^2/y$ is convex on the open half-space $y > 0$ , and the objective $e^{-x}$ is convex, so this is a convex program.

Primal. Feasibility requires $x^2/y \leq 0$ with $y > 0$ , which forces $x = 0$ . Every feasible point has the form $(0, y)$ with $y > 0$ , giving objective value $e^0 = 1$ . So $p^* = 1$ .

Slater fails. Slater's condition requires a strictly feasible point: some $(x, y)$ with $x^2/y < 0$ and $y > 0$ . But $x^2/y \geq 0$ whenever $y > 0$ , so no such point exists.

Dual. The Lagrangian is $L(x, y, \lambda) = e^{-x} + \lambda x^2/y$ on $y > 0$ , with $\lambda \geq 0$ . For any $\lambda > 0$ , sending $x \to +\infty$ with $y$ fixed drives $e^{-x} \to 0$ and $\lambda x^2/y$ stays finite for each fixed $x$ , but more carefully: take $y = x^2$ and $x \to +\infty$ , then $L = e^{-x} + \lambda \to \lambda$ . Taking $y \to +\infty$ with $x$ fixed drives $\lambda x^2/y \to 0$ and $x \to +\infty$ drives $e^{-x} \to 0$ , so $\inf_{x, y > 0} L(x, y, \lambda) = 0$ . For $\lambda = 0$ , $\inf e^{-x} = 0$ . Thus $q(\lambda) = 0$ for all $\lambda \geq 0$ , giving $d^* = 0$ .

Conclusion. $d^* = 0 < 1 = p^*$ , a positive duality gap of 1. Slater's condition is sufficient (not necessary) for strong duality under convexity; when it fails, the gap can be strictly positive.

report a correction →

KKT Conditions

Definition

Karush-Kuhn-Tucker (KKT) Conditions

For the convex problem $\min_x f(x)$ subject to $g_i(x) \leq 0$ , under strong duality, the primal-dual pair $(x^*, \lambda^*)$ is optimal if and only if:

Primal feasibility: $g_i(x^*) \leq 0$ for all $i$
Dual feasibility: $\lambda_i^* \geq 0$ for all $i$
Stationarity: $0 \in \partial f(x^*) + \sum_i \lambda_i^* \partial g_i(x^*)$
Complementary slackness: $\lambda_i^* g_i(x^*) = 0$ for all $i$

Complementary slackness says: either a constraint is tight ( $g_i(x^*) = 0$ ) or its multiplier is zero ( $\lambda_i^* = 0$ ). Active constraints "matter"; inactive constraints are irrelevant. In SVMs, the data points with $\lambda_i^* > 0$ are the support vectors.

Sion Minimax Theorem

Definition

Sion Minimax Theorem

If $\mathcal{X}$ is convex and compact, $\mathcal{Y}$ is convex, and $\phi(x, y)$ is convex-concave (convex in $x$ for each $y$ , concave in $y$ for each $x$ ) and lower-semicontinuous in $x$ , upper-semicontinuous in $y$ , then:

$\min_{x \in \mathcal{X}} \sup_{y \in \mathcal{Y}} \phi(x, y) = \sup_{y \in \mathcal{Y}} \min_{x \in \mathcal{X}} \phi(x, y)$

The min and sup can be interchanged. This is a generalization of von Neumann's minimax theorem for zero-sum games.

Why it matters for ML: The minimax theorem underlies adversarial formulations. In GANs, DRO, and robust optimization, you want to swap a min over model parameters with a max over adversarial perturbations. The Sion theorem tells you when this swap is valid.

Fenchel-Rockafellar Duality

Many problems in machine learning take the composite form $\inf_x f(x) + g(Ax)$ where $f, g$ are proper convex lower-semicontinuous functions on $\mathbb{R}^d$ and $\mathbb{R}^n$ respectively, and $A \in \mathbb{R}^{n \times d}$ is a bounded linear operator. The Fenchel-Rockafellar dual is:

$\sup_y \; -f^*(-A^\top y) - g^*(y)$

Weak duality always holds. Strong duality holds under mild closedness and constraint-qualification conditions, for instance $0 \in \text{int}(\text{dom}(g) - A\, \text{dom}(f))$ (Rockafellar 1970, Theorem 31.1). This pattern is the backbone of composite optimization: the lasso dual takes $f(x) = \lambda\|x\|_1$ , $g(z) = \frac{1}{2}\|z - b\|^2$ , $A$ the design matrix; TV-denoising takes $g$ a squared data-fit term and $f(x) = \|\nabla x\|_1$ through a discrete gradient operator $A$ . Writing both sides simultaneously exposes primal-dual algorithms (Chambolle-Pock, ADMM) that exploit the conjugate structure of each component.

Moreau-Yosida Regularization and Proximal Operators

Definition

Moreau Envelope and Proximal Operator

For a proper convex lsc function $f$ and $\lambda > 0$ , the Moreau envelope (or Moreau-Yosida regularization) is:

$f_\lambda(x) = \inf_y \left\{ f(y) + \frac{1}{2\lambda}\|y - x\|^2 \right\}$

The proximal operator is the unique minimizer:

$\text{prox}_{\lambda f}(x) = \arg\min_y \left\{ f(y) + \frac{1}{2\lambda}\|y - x\|^2 \right\}$

The envelope $f_\lambda$ is convex, continuously differentiable with $\nabla f_\lambda(x) = \frac{1}{\lambda}(x - \text{prox}_{\lambda f}(x))$ , and satisfies $f_\lambda \uparrow f$ as $\lambda \downarrow 0$ . The key duality is the Moreau decomposition:

$x = \text{prox}_{\lambda f}(x) + \lambda \, \text{prox}_{f^*/\lambda}(x/\lambda)$

Proximal operators are to non-smooth convex functions what gradients are to smooth ones. Proximal-gradient methods (ISTA, FISTA) and ADMM use them as the main iteration building block, turning non-smooth regularized problems into tractable updates. See Rockafellar 1970, Chapter 12, and Parikh and Boyd, Proximal Algorithms, Foundations and Trends in Optimization 1(3):127-239, 2014.

Duality in Distributionally Robust Optimization

Wasserstein distributionally robust optimization replaces the empirical risk $\mathbb{E}_{\hat P_n}[\ell(\theta, \xi)]$ by the worst case over a Wasserstein ball $\{P : W_p(P, \hat P_n) \leq \varepsilon\}$ . Strong duality gives a tractable single-variable dual:

$\sup_{P : W_p(P, \hat P_n) \leq \varepsilon} \mathbb{E}_P[\ell(\theta, \xi)] = \inf_{\gamma \geq 0} \left\{ \gamma \varepsilon^p + \mathbb{E}_{\hat P_n}\left[ \sup_\xi \{\ell(\theta, \xi) - \gamma \|\xi - \hat\xi\|^p\} \right] \right\}$

Under mild regularity this collapses to a variance or gradient-norm penalty on the ERM objective, linking robustness to standard regularization. See Esfahani and Kuhn, Math. Prog. 171:115-166, 2018 (arXiv:1505.05116) and Blanchet and Murthy, Math. Oper. Res. 44(2):565-600, 2019 (arXiv:1604.01446).

Canonical Examples

Example

Conjugate of the squared norm

$f(x) = \frac{1}{2}\|x\|^2$ . Then:

$f^*(y) = \sup_x \{y^\top x - \frac{1}{2}\|x\|^2\}$

Taking the gradient and setting to zero: $y - x = 0$ , so $x^* = y$ . Substituting: $f^*(y) = y^\top y - \frac{1}{2}\|y\|^2 = \frac{1}{2}\|y\|^2$ .

The squared norm is its own conjugate. This self-duality is why $\ell_2$ regularization leads to particularly clean dual problems (e.g., ridge regression).

Example

Duality between constrained and penalized optimization

Primal (constrained): $\min_w f(w)$ subject to $\|w\| \leq r$

Lagrangian: $L(w, \lambda) = f(w) + \lambda(\|w\| - r)$

Dual: $\max_{\lambda \geq 0} \{\inf_w [f(w) + \lambda\|w\|] - \lambda r\}$

The inner minimization $\inf_w [f(w) + \lambda\|w\|]$ is exactly the penalized problem with regularization parameter $\lambda$ .

By strong duality, there exists $\lambda^*$ such that the constrained problem with radius $r$ and the penalized problem with parameter $\lambda^*$ have the same solution. This is the rigorous justification for the equivalence between norm-constrained and norm-penalized regularization.

Common Confusions

Watch Out

Weak duality always holds; strong duality requires conditions

Students sometimes assume that the primal and dual always have the same optimal value. Weak duality ( $d^* \leq p^*$ ) is trivially true, but strong duality ( $d^* = p^*$ ) requires assumptions like convexity plus Slater's condition. For non-convex problems, the duality gap can be arbitrarily large. Always check the conditions before claiming strong duality.

Watch Out

The dual problem is always concave, even if the primal is non-convex

The dual function $q(\lambda) = \inf_x L(x, \lambda)$ is concave in $\lambda$ regardless of whether $f$ or the $g_i$ are convex. This is because $q$ is a pointwise infimum of affine functions in $\lambda$ . The dual is always a concave maximization problem, which is computationally tractable. The catch: for non-convex primal problems, $d^* < p^*$ and the dual only gives a lower bound.

Watch Out

KKT conditions are necessary and sufficient only under strong duality

For general non-convex problems, KKT is only necessary (assuming constraint qualification). For convex problems with strong duality, KKT is both necessary and sufficient. In ML, most problems of interest (SVMs, regularized linear models, DRO) are convex and satisfy Slater's condition, so KKT gives exact optimality conditions.

Summary

Fenchel conjugate: $f^*(y) = \sup_x \{y^\top x - f(x)\}$
Young-Fenchel inequality: $x^\top y \leq f(x) + f^*(y)$ , with equality iff $y \in \partial f(x)$
Fenchel-Moreau: $f^{**} = f$ for closed convex functions
Weak duality always holds ( $d^* \leq p^*$ ); strong duality requires convexity + Slater's condition
KKT conditions: primal feasibility, dual feasibility, stationarity, complementary slackness
Duality converts constrained problems into penalized problems (regularization = Lagrangian duality)
SVM dual reveals the kernel trick; DRO dual yields tractable regularization

Exercises

ExerciseCore

Problem

Compute the Fenchel conjugate of $f(x) = \|x\|_1$ (the $\ell_1$ norm on $\mathbb{R}^d$ ). What is the dual norm of $\ell_1$ ?

ExerciseAdvanced

Problem

Derive the dual of the soft-margin SVM problem:

$\min_{w, b, \xi} \frac{1}{2}\|w\|^2 + C\sum_i \xi_i \quad \text{s.t. } y_i(w^\top x_i + b) \geq 1 - \xi_i, \; \xi_i \geq 0$

Show that the dual depends on the data only through inner products $x_i^\top x_j$ , which enables the kernel trick.

Related Comparisons

Weak Duality vs. Strong Duality

Further directions

LP duality as a special case (primal-dual simplex)
Lasso dual and ridge dual in compact form
Subdifferential calculus: sum rule, chain rule, composition rule
Fenchel duality in Banach spaces (infinite-dimensional generalization)
Rockafellar's closedness conditions for strong duality without Slater
Interactive diagram: duality gap closing under Slater
Quiz

References

Canonical:

Rockafellar, Convex Analysis (1970), Chapters 12, 26-28, 31
Boyd & Vandenberghe, Convex Optimization (2004), Chapter 5
Borwein & Lewis, Convex Analysis and Nonlinear Optimization (2nd ed., 2006), Chapter 4, Section 4.3 (Lagrangian duality)
Hiriart-Urruty and Lemarechal, Convex Analysis and Minimization Algorithms I (1993), Chapters III-V (convex functions, conjugacy, and duality)

Current:

Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 15 (SVM duality)
Ben-Tal, El Ghaoui, Nemirovski, Robust Optimization (2009), Chapter 4

Next Topics

Building on convex duality:

Support vector machines: the canonical application of Lagrangian duality in ML
Regularization theory: constrained and penalized formulations as dual views
Kernels and RKHS: the kernel trick as a consequence of the SVM dual

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Convex Optimization Basicslayer 1 · tier 1
Subgradients and Subdifferentialslayer 1 · tier 1
Inverse and Implicit Function Theoremlayer 0A · tier 2

Derived topics

10

Support Vector Machineslayer 2 · tier 1
Augmented Lagrangian and ADMMlayer 2 · tier 2
Minimax and Saddle Pointslayer 2 · tier 2
Regularization Theorylayer 2 · tier 2
Von Neumann Minimax Theoremlayer 2 · tier 2

+5 more on the derived-topics page.

Graph-backed continuations

Support Vector Machines Regularization Theory Kernels and Reproducing Kernel Hilbert Spaces Augmented Lagrangian and ADMM Information Geometry Minimax and Saddle Points Von Neumann Minimax Theorem Mirror Descent and Frank-Wolfe Optimal Transport and Earth Mover's Distance Wasserstein Distances