Optimization Function Classes
Convex Optimization Basics
Convex sets, convex functions, gradient descent convergence, strong convexity, and duality: the optimization foundation that every learning-theoretic result silently depends on.
Why This Matters
Every learning algorithm solves an optimization problem. When you minimize empirical risk, train a neural network, or fit a kernel SVM, you are running an optimization algorithm on a loss landscape. Convex optimization is the case where this landscape has no local minima traps. every local minimum is global.
This is not a "separate" subject from learning theory. The convergence rate of your optimizer determines the computational cost of learning. The properties of your objective (smoothness, strong convexity) determine both how fast you can optimize and how well the result generalizes (via algorithmic stability). Optimization and generalization are entangled.
Mental Model
A convex function is a bowl: any line segment between two points on the graph lies above the graph. This means gradient descent, which always moves "downhill," will reach the bottom. The questions are:
- How fast? (convergence rate)
- How close to the true minimum? (optimization error)
- What properties of the function control the answers?
Formal Setup: Convex Sets and Functions
Convex Set
A set is convex if for all and all :
Every point on the line segment between and stays in . Examples: balls, halfspaces, polyhedra, the probability simplex. Non-examples: the union of two disjoint balls, the set .
Convex Function
A function on a convex set is convex if for all and :
Equivalently, lies below its chords. If the inequality is strict for and , is strictly convex.
First-Order Condition for Convexity
If is differentiable, then is convex if and only if for all :
This says: lies above every tangent hyperplane. The tangent at any point is a global lower bound on . This single inequality is the most useful characterization of convexity in practice.
Smoothness (L-smooth)
A differentiable function is -smooth if is -Lipschitz:
Equivalently, for all :
Smoothness says the function does not curve too sharply. is the maximum curvature. This is the key property that makes gradient descent work: it guarantees that a gradient step of size makes sufficient progress.
Strong Convexity
A function is -strongly convex (with ) if for all :
This says curves at least as much as a quadratic with curvature . Strong convexity implies a unique minimizer and gives:
Condition Number
The condition number of an -smooth, -strongly convex function is . It measures how "elongated" the level sets are. means the function is a perfect isotropic quadratic. Large means the function is badly conditioned: steep in some directions, flat in others.
The condition number controls the convergence rate of gradient descent: , so ill-conditioned problems converge slowly.
Gradient Descent
The gradient descent algorithm with step size :
Starting from , repeat until convergence. The step size is the single most important hyperparameter. Too large: diverge. Too small: crawl.
Main Theorems
GD Convergence for Smooth Convex Functions
Statement
If is convex and -smooth, then gradient descent with step size satisfies:
That is, the optimization error decreases as .
Intuition
Each gradient step with makes progress proportional to . Even when gradients become small near the optimum, the convexity of ensures that the iterates steadily approach . The rate means you need iterations to reach -accuracy.
Proof Sketch
Step 1 (Descent Lemma): By -smoothness, choosing gives:
Step 2: By convexity, .
Step 3: Track . By the update rule:
Combine using convexity: . Sum from to . The left side telescopes. Rearrange to get the bound.
Why It Matters
This is the baseline convergence result. The rate tells you that gradient descent works but is not spectacularly fast for merely convex problems. To do better, you need either (a) strong convexity or (b) acceleration (Nesterov's method gives , which is optimal).
In ML, this rate determines the number of passes over the data needed to approximately solve the ERM problem.
Failure Mode
The bound scales with : bad initialization hurts. Also, the rate applies to the function value gap, not the parameter distance . For the parameter distance, you need strong convexity.
GD Convergence for Smooth and Strongly Convex Functions
Statement
If is -strongly convex and -smooth, then gradient descent with satisfies:
where is the condition number. The convergence is linear (exponential decrease): the error contracts by a factor per iteration.
Intuition
Strong convexity ensures that when you are far from , the gradient is large, so gradient descent makes large steps. When you are close, the gradient is small, but you do not need large steps. The exponential convergence rate means the number of iterations to reach -accuracy is . logarithmic in the target accuracy.
Proof Sketch
By the descent lemma: .
By strong convexity: .
Combining: .
Iterate: .
Replace with by smoothness.
Why It Matters
This is the workhorse convergence result for regularized ERM. Adding regularization makes any convex loss -strongly convex with . The condition number becomes , and the optimization converges in iterations.
This directly connects to algorithmic stability: the same strong convexity that gives fast optimization also gives stability parameter .
Failure Mode
Large condition number means slow convergence. For ridge regression with small regularization, can be enormous. Preconditioning or second-order methods can help, but at higher per-iteration cost.
Duality Preview
Every convex optimization problem has a dual problem. For the primal:
The Lagrangian is with . The dual problem is:
Weak duality: dual optimal value primal optimal value (always). Strong duality: equality holds (under constraint qualifications like Slater's condition).
Why does duality matter for ML?
- SVMs: the dual of the SVM problem is a quadratic program in the support vector coefficients, leading to the kernel trick
- Regularization: the dual of constrained ERM relates to regularized ERM (Lagrangian multiplier regularization parameter)
- Lower bounds: dual certificates provide provable lower bounds on the optimal value, enabling stopping criteria
Canonical Examples
Quadratic (least squares)
where . This is -smooth with and (if has full rank) -strongly convex with . The condition number is , the squared condition number of .
GD converges in steps. For well-conditioned , this is fast. For ill-conditioned (nearly collinear features), this is slow.
Logistic regression
is convex and smooth (but not strongly convex without regularization). Adding makes it -strongly convex. The smoothness constant depends on : the logistic loss has Lipschitz gradient with constant .
Non-smooth: Lasso
. The penalty is not smooth (not differentiable at ). Standard gradient descent does not apply directly. You need proximal gradient descent, which handles the smooth part with a gradient step and the non-smooth part with a "prox" operator. The prox of is soft-thresholding, and the resulting algorithm (ISTA) converges at rate .
Common Confusions
Convexity is about the objective, not the model
A linear model trained with a non-convex loss has a non-convex optimization problem. A neural network trained with cross-entropy has a non-convex problem (because the model is non-linear in parameters, even though cross-entropy is convex in predictions). Convexity of the optimization landscape depends on the composition of loss and model.
Smoothness and strong convexity are separate properties
Smoothness bounds curvature from above ( does not curve too fast). Strong convexity bounds curvature from below ( curves at least a certain amount). A function can be smooth but not strongly convex (e.g., smoothed near ), or strongly convex but not smooth (e.g., ). You need both for the fast rate.
The step size 1/L is not always practical
The theoretical step size requires knowing exactly, which is often unknown. In practice, you use line search or adaptive step sizes (Adam, AdaGrad). The theory with gives a benchmark: this is the best GD can do, and practical methods should match or beat it.
Why Optimization is Not Separate from Learning Theory
The total error of a learning algorithm decomposes as:
The optimization error is the gap between what the algorithm actually finds and the true ERM minimizer. Convex optimization theory directly controls this third term. If you can solve the ERM problem to -accuracy in iterations, and each iteration takes time, the total computational cost of learning is .
Exercises
Problem
Prove that the function is -strongly convex and -smooth when is symmetric with eigenvalues in . What is the condition number?
Problem
How many gradient descent iterations are needed to reach accuracy on a -strongly convex, -smooth function with ? Compare with the merely convex case (same , ).
Problem
Show that adding regularization to a convex, -smooth function produces a function that is -strongly convex and -smooth. What is the new condition number?
Related Comparisons
References
Canonical:
- Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-3, 9
- Nesterov, Introductory Lectures on Convex Optimization (2004), Chapters 1-2
- Nocedal & Wright, Numerical Optimization (2006), Chapters 2-3, 11
- Bertsekas, Convex Optimization Algorithms (2015), Chapters 1-2
Current:
-
Bubeck, "Convex Optimization: Algorithms and Complexity" (2015), Found. & Trends in ML
-
Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (2018), SIAM Review
Next Topics
Natural next steps from convex optimization:
- Regularization theory: how regularization connects optimization to generalization
- Kernels and RKHS: convex optimization in infinite-dimensional function spaces
- Online convex optimization: extending convex optimization to sequential, adversarial settings
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
Builds on This
- Activation FunctionsLayer 1
- Ascent Algorithms and Hill ClimbingLayer 1
- Augmented Lagrangian and ADMMLayer 2
- Bounded RationalityLayer 2
- Conjugate Gradient MethodsLayer 2
- Convex DualityLayer 0B
- Coordinate DescentLayer 2
- The EM AlgorithmLayer 2
- Expected Utility TheoryLayer 2
- Game Theory FoundationsLayer 2
- Gradient Descent VariantsLayer 1
- Interior Point MethodsLayer 3
- K-Means ClusteringLayer 1
- Kernels and Reproducing Kernel Hilbert SpacesLayer 3
- Lasso RegressionLayer 2
- Line Search MethodsLayer 2
- Logistic RegressionLayer 1
- Markov Decision ProcessesLayer 2
- Minimax and Saddle PointsLayer 2
- Mirror Descent and Frank-WolfeLayer 3
- Nash EquilibriumLayer 2
- Newton's MethodLayer 1
- Online Convex OptimizationLayer 3
- Optimizer Theory: SGD, Adam, and MuonLayer 3
- Policy Gradient TheoremLayer 3
- Preconditioned Optimizers: Shampoo, K-FAC, and Natural GradientLayer 3
- Projected Gradient DescentLayer 2
- Proximal Gradient MethodsLayer 2
- Regularization TheoryLayer 2
- Ridge RegressionLayer 2
- Riemannian Optimization and Manifold ConstraintsLayer 3
- Scaling LawsLayer 4
- Stability and Optimization DynamicsLayer 2
- Stochastic Approximation TheoryLayer 2
- Support Vector MachinesLayer 2
- Training Dynamics and Loss LandscapesLayer 4
- Trust Region MethodsLayer 2