Mathematical Infrastructure
Convex Duality
Fenchel conjugates, the Fenchel-Moreau theorem, weak and strong duality, KKT conditions, and why duality gives the kernel trick for SVMs, connects regularization to constraints, and enables adversarial formulations in DRO.
Prerequisites
Why This Matters
Duality is the tool that converts hard optimization problems into equivalent (or approximately equivalent) problems that are easier to solve, analyze, or interpret. In machine learning, duality is not an abstract luxury --- it is the engine behind some of the most important algorithms and insights:
- SVMs: the dual of the SVM problem reveals the kernel trick, transforming an optimization in parameter space into one in the space of inner products between data points
- Regularization: constrained ERM (minimize loss subject to a norm constraint) and regularized ERM (minimize loss plus a penalty) are dual to each other. The Lagrange multiplier is the regularization parameter
- DRO: distributionally robust optimization uses duality to convert a minimax problem (worst-case over distributions) into a tractable regularized problem
If you do not understand duality, you cannot understand why the kernel trick works, why regularization is equivalent to constraining, or how adversarial formulations lead to tractable algorithms.
Mental Model
Every convex minimization problem has a "shadow" problem --- its dual --- which is a maximization problem. The dual always provides a lower bound on the primal optimum (weak duality). Under mild conditions (strong duality), the two optima are equal. At the optimal point, the primal and dual variables satisfy complementary conditions (KKT).
The Fenchel conjugate is the fundamental building block: it converts a function into its "dual representation." For background on convex sets and functions, see convex optimization basics. The conjugate of the conjugate gives back the original function (for closed convex functions), which is the Fenchel-Moreau theorem.
Formal Setup
Fenchel Conjugate (Convex Conjugate)
The Fenchel conjugate (or convex conjugate) of a function is:
The conjugate is always convex (as a supremum of affine functions in ), even if is not convex.
Interpretation: measures the maximum "gap" between the linear function and . Geometrically, is related to the supporting hyperplane of the epigraph of with slope .
Young-Fenchel Inequality
For any and its conjugate , and for all :
This is immediate from the definition: . Equality holds if and only if (the subdifferential of at ).
Key conjugate pairs (you should know these):
| (any norm) | (indicator of dual norm ball) |
| () | |
| for , for , for | |
| (indicator of convex set) | (support function) |
Main Theorems
Fenchel-Moreau Biconjugation Theorem
Statement
If is proper (not identically and never ), lower-semicontinuous, and convex, then:
That is, the conjugate of the conjugate recovers the original function.
Intuition
A closed convex function is completely determined by its supporting hyperplanes. The Fenchel conjugate encodes these hyperplanes. Conjugating twice reconstructs the function from its hyperplane representation. This is analogous to the Fourier transform: applying it twice (with appropriate normalization) gives back the original function.
If is not convex or not lower-semicontinuous, then is the closed convex envelope of --- the largest closed convex function that is pointwise .
Proof Sketch
One direction is easy: follows from the Young-Fenchel inequality (for each , ).
The other direction uses the supporting hyperplane theorem: for every and every , the point lies below the epigraph of . Since is closed and convex, a separating hyperplane gives a with . Since is arbitrary, .
Why It Matters
Fenchel-Moreau is the theoretical foundation of all convex duality. It says that every closed convex function has a perfect "dual representation" via its conjugate. This enables:
-
Converting between constrained and penalized formulations: the conjugate of an indicator function is a support function, and vice versa. This is exactly the duality between norm-constrained and norm-penalized optimization.
-
Deriving dual problems: the Lagrangian dual of a convex problem can be written in terms of Fenchel conjugates. Strong duality () ensures no duality gap.
-
Variational representations: many quantities in information theory (KL divergence, entropy) have dual representations via Fenchel conjugates. The Donsker-Varadhan variational formula for KL divergence is a consequence.
Failure Mode
If is not closed (lower-semicontinuous) or not convex, then . The biconjugate will be the closed convex envelope, which can differ significantly. For example, if , then is concave, and for all . Always verify closedness and convexity before applying Fenchel-Moreau.
Lagrangian Duality
Lagrangian and Dual Problem
Consider the primal problem:
The Lagrangian is:
The dual function is . The dual problem is:
The dual function is always concave (as an infimum of affine functions in ), regardless of whether the primal is convex.
Weak and Strong Duality
Let be the primal optimal value and the dual optimal value.
Weak duality (always holds): . The dual provides a lower bound on the primal.
Strong duality: . The duality gap is zero.
Strong duality holds for convex problems under Slater's condition: there exists a strictly feasible point with for all .
Strong Duality via Slater's Condition
Statement
If and are convex and there exists a strictly feasible point with for all , then strong duality holds: the primal and dual optimal values are equal, and the dual optimum is attained.
Intuition
Slater's condition says the feasible region has a non-empty interior (the constraints are not "barely" satisfied). This prevents pathological cases where the primal feasible set is "too thin" for duality to work. In practice, Slater's condition holds for almost every convex optimization problem you encounter in ML.
Proof Sketch
Consider the set . This set is convex (because and are convex). The point is on the boundary of . By the supporting hyperplane theorem, there exists a hyperplane separating from the interior of . This hyperplane defines the optimal dual variables . Slater's condition ensures the hyperplane has the right orientation (the components are non-negative), which gives .
Why It Matters
Strong duality is what makes dual methods actually useful:
-
SVM dual: the primal SVM minimizes over (potentially high-dimensional). The dual minimizes over (one per data point). Under strong duality, both give the same answer. The dual depends on data only through inner products , enabling the kernel trick: replace inner products with .
-
Regularization duality: minimizing subject to is equivalent (by strong duality) to minimizing for some . The Lagrange multiplier is the regularization parameter. This is why constraint-based and penalty-based regularization are interchangeable.
-
DRO: worst-case expected loss over a ball of distributions can be dualized into a regularized empirical risk problem. This converts an intractable minimax problem into a tractable convex program.
Failure Mode
Without Slater's condition, strong duality can fail. Example: minimize subject to . The only feasible point is (primal value ). But the Lagrangian gives . For : . So (approached but not attained). In this case strong duality technically holds, but consider modified examples where the constraint is an equality on a convex function: Slater's condition fails and a gap can appear.
KKT Conditions
Karush-Kuhn-Tucker (KKT) Conditions
For the convex problem subject to , under strong duality, the primal-dual pair is optimal if and only if:
- Primal feasibility: for all
- Dual feasibility: for all
- Stationarity:
- Complementary slackness: for all
Complementary slackness says: either a constraint is tight () or its multiplier is zero (). Active constraints "matter"; inactive constraints are irrelevant. In SVMs, the data points with are the support vectors.
Sion Minimax Theorem
Sion Minimax Theorem
If is convex and compact, is convex, and is convex-concave (convex in for each , concave in for each ) and lower-semicontinuous in , upper-semicontinuous in , then:
The min and sup can be interchanged. This is a generalization of von Neumann's minimax theorem for zero-sum games.
Why it matters for ML: The minimax theorem underlies adversarial formulations. In GANs, DRO, and robust optimization, you want to swap a min over model parameters with a max over adversarial perturbations. The Sion theorem tells you when this swap is valid.
Canonical Examples
Conjugate of the squared norm
. Then:
Taking the gradient and setting to zero: , so . Substituting: .
The squared norm is its own conjugate. This self-duality is why regularization leads to particularly clean dual problems (e.g., ridge regression).
Duality between constrained and penalized optimization
Primal (constrained): subject to
Lagrangian:
Dual:
The inner minimization is exactly the penalized problem with regularization parameter .
By strong duality, there exists such that the constrained problem with radius and the penalized problem with parameter have the same solution. This is the rigorous justification for the equivalence between norm-constrained and norm-penalized regularization.
Common Confusions
Weak duality always holds; strong duality requires conditions
Students sometimes assume that the primal and dual always have the same optimal value. Weak duality () is trivially true, but strong duality () requires assumptions like convexity plus Slater's condition. For non-convex problems, the duality gap can be arbitrarily large. Always check the conditions before claiming strong duality.
The dual problem is always concave, even if the primal is non-convex
The dual function is concave in regardless of whether or the are convex. This is because is a pointwise infimum of affine functions in . The dual is always a concave maximization problem, which is computationally tractable. The catch: for non-convex primal problems, and the dual only gives a lower bound.
KKT conditions are necessary and sufficient only under strong duality
For general non-convex problems, KKT is only necessary (assuming constraint qualification). For convex problems with strong duality, KKT is both necessary and sufficient. In ML, most problems of interest (SVMs, regularized linear models, DRO) are convex and satisfy Slater's condition, so KKT gives exact optimality conditions.
Summary
- Fenchel conjugate:
- Young-Fenchel inequality: , with equality iff
- Fenchel-Moreau: for closed convex functions
- Weak duality always holds (); strong duality requires convexity + Slater's condition
- KKT conditions: primal feasibility, dual feasibility, stationarity, complementary slackness
- Duality converts constrained problems into penalized problems (regularization = Lagrangian duality)
- SVM dual reveals the kernel trick; DRO dual yields tractable regularization
Exercises
Problem
Compute the Fenchel conjugate of (the norm on ). What is the dual norm of ?
Problem
Derive the dual of the soft-margin SVM problem:
Show that the dual depends on the data only through inner products , which enables the kernel trick.
Related Comparisons
References
Canonical:
- Rockafellar, Convex Analysis (1970), Chapters 12, 26-28, 31
- Boyd & Vandenberghe, Convex Optimization (2004), Chapter 5
- Borwein & Lewis, Convex Analysis and Nonlinear Optimization (2nd ed., 2006)
Current:
- Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 15 (SVM duality)
- Ben-Tal, El Ghaoui, Nemirovski, Robust Optimization (2009), Chapter 4
Next Topics
Building on convex duality:
- Support vector machines: the canonical application of Lagrangian duality in ML
- Regularization theory: constrained and penalized formulations as dual views
- Kernels and RKHS: the kernel trick as a consequence of the SVM dual
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A