Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Optimization Function Classes

Convex Optimization Basics

Convex sets, convex functions, gradient descent convergence, strong convexity, and duality: the optimization foundation that every learning-theoretic result silently depends on.

CoreTier 1Stable~60 min

Why This Matters

Optimizer Race
Surface:
Race:
0.050
start-202-202SGDStep 0Loss: 9.0000(-2.00, 2.50)AdamStep 0Loss: 9.0000(-2.00, 2.50)Loss vs. Step (log scale)

Every learning algorithm solves an optimization problem. When you minimize empirical risk, train a neural network, or fit a kernel SVM, you are running an optimization algorithm on a loss landscape. Convex optimization is the case where this landscape has no local minima traps. every local minimum is global.

This is not a "separate" subject from learning theory. The convergence rate of your optimizer determines the computational cost of learning. The properties of your objective (smoothness, strong convexity) determine both how fast you can optimize and how well the result generalizes (via algorithmic stability). Optimization and generalization are entangled.

Mental Model

A convex function is a bowl: any line segment between two points on the graph lies above the graph. This means gradient descent, which always moves "downhill," will reach the bottom. The questions are:

  • How fast? (convergence rate)
  • How close to the true minimum? (optimization error)
  • What properties of the function control the answers?

Formal Setup: Convex Sets and Functions

Definition

Convex Set

A set CRd\mathcal{C} \subseteq \mathbb{R}^d is convex if for all x,yCx, y \in \mathcal{C} and all θ[0,1]\theta \in [0, 1]:

θx+(1θ)yC\theta x + (1-\theta) y \in \mathcal{C}

Every point on the line segment between xx and yy stays in C\mathcal{C}. Examples: balls, halfspaces, polyhedra, the probability simplex. Non-examples: the union of two disjoint balls, the set {x:x1}\{x : \|x\| \geq 1\}.

Definition

Convex Function

A function f:CRf: \mathcal{C} \to \mathbb{R} on a convex set C\mathcal{C} is convex if for all x,yCx, y \in \mathcal{C} and θ[0,1]\theta \in [0, 1]:

f(θx+(1θ)y)θf(x)+(1θ)f(y)f(\theta x + (1-\theta) y) \leq \theta f(x) + (1-\theta) f(y)

Equivalently, ff lies below its chords. If the inequality is strict for xyx \neq y and θ(0,1)\theta \in (0, 1), ff is strictly convex.

Definition

First-Order Condition for Convexity

If ff is differentiable, then ff is convex if and only if for all x,yx, y:

f(y)f(x)+f(x)(yx)f(y) \geq f(x) + \nabla f(x)^\top (y - x)

This says: ff lies above every tangent hyperplane. The tangent at any point is a global lower bound on ff. This single inequality is the most useful characterization of convexity in practice.

Definition

Smoothness (L-smooth)

A differentiable function ff is LL-smooth if f\nabla f is LL-Lipschitz:

f(x)f(y)Lxyx,y\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\| \quad \forall x, y

Equivalently, for all x,yx, y:

f(y)f(x)+f(x)(yx)+L2yx2f(y) \leq f(x) + \nabla f(x)^\top(y-x) + \frac{L}{2}\|y - x\|^2

Smoothness says the function does not curve too sharply. LL is the maximum curvature. This is the key property that makes gradient descent work: it guarantees that a gradient step of size 1/L1/L makes sufficient progress.

Definition

Strong Convexity

A function ff is μ\mu-strongly convex (with μ>0\mu > 0) if for all x,yx, y:

f(y)f(x)+f(x)(yx)+μ2yx2f(y) \geq f(x) + \nabla f(x)^\top(y-x) + \frac{\mu}{2}\|y - x\|^2

This says ff curves at least as much as a quadratic with curvature μ\mu. Strong convexity implies a unique minimizer xx^* and gives:

f(x)f(x)μ2xx2f(x) - f(x^*) \geq \frac{\mu}{2}\|x - x^*\|^2

Definition

Condition Number

The condition number of an LL-smooth, μ\mu-strongly convex function is κ=L/μ\kappa = L/\mu. It measures how "elongated" the level sets are. κ=1\kappa = 1 means the function is a perfect isotropic quadratic. Large κ\kappa means the function is badly conditioned: steep in some directions, flat in others.

The condition number controls the convergence rate of gradient descent: O((11/κ)t)O((1 - 1/\kappa)^t), so ill-conditioned problems converge slowly.

Gradient Descent

The gradient descent algorithm with step size η>0\eta > 0:

xt+1=xtηf(xt)x_{t+1} = x_t - \eta \nabla f(x_t)

Starting from x0x_0, repeat until convergence. The step size η\eta is the single most important hyperparameter. Too large: diverge. Too small: crawl.

Main Theorems

Theorem

GD Convergence for Smooth Convex Functions

Statement

If ff is convex and LL-smooth, then gradient descent with step size η=1/L\eta = 1/L satisfies:

f(xT)f(x)Lx0x22Tf(x_T) - f(x^*) \leq \frac{L\|x_0 - x^*\|^2}{2T}

That is, the optimization error decreases as O(1/T)O(1/T).

Intuition

Each gradient step with η=1/L\eta = 1/L makes progress proportional to f(xt)2/L\|\nabla f(x_t)\|^2/L. Even when gradients become small near the optimum, the convexity of ff ensures that the iterates steadily approach xx^*. The 1/T1/T rate means you need T=O(1/ϵ)T = O(1/\epsilon) iterations to reach ϵ\epsilon-accuracy.

Proof Sketch

Step 1 (Descent Lemma): By LL-smoothness, choosing η=1/L\eta = 1/L gives:

f(xt+1)f(xt)12Lf(xt)2f(x_{t+1}) \leq f(x_t) - \frac{1}{2L}\|\nabla f(x_t)\|^2

Step 2: By convexity, f(xt)f(x)f(xt)(xtx)f(xt)xtxf(x_t) - f(x^*) \leq \nabla f(x_t)^\top(x_t - x^*) \leq \|\nabla f(x_t)\| \cdot \|x_t - x^*\|.

Step 3: Track xtx2\|x_t - x^*\|^2. By the update rule:

xt+1x2=xtx22Lf(xt)(xtx)+1L2f(xt)2\|x_{t+1} - x^*\|^2 = \|x_t - x^*\|^2 - \frac{2}{L}\nabla f(x_t)^\top(x_t - x^*) + \frac{1}{L^2}\|\nabla f(x_t)\|^2

Combine using convexity: f(xt)(xtx)f(xt)f(x)\nabla f(x_t)^\top(x_t - x^*) \geq f(x_t) - f(x^*). Sum from t=0t = 0 to T1T-1. The left side telescopes. Rearrange to get the O(1/T)O(1/T) bound.

Why It Matters

This is the baseline convergence result. The O(1/T)O(1/T) rate tells you that gradient descent works but is not spectacularly fast for merely convex problems. To do better, you need either (a) strong convexity or (b) acceleration (Nesterov's method gives O(1/T2)O(1/T^2), which is optimal).

In ML, this rate determines the number of passes over the data needed to approximately solve the ERM problem.

Failure Mode

The bound scales with x0x2\|x_0 - x^*\|^2: bad initialization hurts. Also, the O(1/T)O(1/T) rate applies to the function value gap, not the parameter distance xTx\|x_T - x^*\|. For the parameter distance, you need strong convexity.

Theorem

GD Convergence for Smooth and Strongly Convex Functions

Statement

If ff is μ\mu-strongly convex and LL-smooth, then gradient descent with η=1/L\eta = 1/L satisfies:

f(xT)f(x)L2(11κ)Tx0x2f(x_T) - f(x^*) \leq \frac{L}{2}\Bigl(1 - \frac{1}{\kappa}\Bigr)^T \|x_0 - x^*\|^2

where κ=L/μ\kappa = L/\mu is the condition number. The convergence is linear (exponential decrease): the error contracts by a factor (11/κ)(1 - 1/\kappa) per iteration.

Intuition

Strong convexity ensures that when you are far from xx^*, the gradient is large, so gradient descent makes large steps. When you are close, the gradient is small, but you do not need large steps. The exponential convergence rate means the number of iterations to reach ϵ\epsilon-accuracy is O(κlog(1/ϵ))O(\kappa \log(1/\epsilon)). logarithmic in the target accuracy.

Proof Sketch

By the descent lemma: f(xt+1)f(xt)12Lf(xt)2f(x_{t+1}) \leq f(x_t) - \frac{1}{2L}\|\nabla f(x_t)\|^2.

By strong convexity: f(xt)22μ(f(xt)f(x))\|\nabla f(x_t)\|^2 \geq 2\mu(f(x_t) - f(x^*)).

Combining: f(xt+1)f(x)(1μ/L)(f(xt)f(x))f(x_{t+1}) - f(x^*) \leq (1 - \mu/L)(f(x_t) - f(x^*)).

Iterate: f(xT)f(x)(11/κ)T(f(x0)f(x))f(x_T) - f(x^*) \leq (1 - 1/\kappa)^T (f(x_0) - f(x^*)).

Replace f(x0)f(x)f(x_0) - f(x^*) with L2x0x2\frac{L}{2}\|x_0 - x^*\|^2 by smoothness.

Why It Matters

This is the workhorse convergence result for regularized ERM. Adding 2\ell_2 regularization λw2\lambda\|w\|^2 makes any convex loss μ\mu-strongly convex with μ=2λ\mu = 2\lambda. The condition number becomes κ=L/(2λ)\kappa = L/(2\lambda), and the optimization converges in O((L/λ)log(1/ϵ))O((L/\lambda)\log(1/\epsilon)) iterations.

This directly connects to algorithmic stability: the same strong convexity that gives fast optimization also gives stability parameter O(1/(λn))O(1/(\lambda n)).

Failure Mode

Large condition number κ\kappa means slow convergence. For ridge regression with small regularization, κ\kappa can be enormous. Preconditioning or second-order methods can help, but at higher per-iteration cost.

Duality Preview

Every convex optimization problem has a dual problem. For the primal:

minxf(x)subject to gi(x)0,  i=1,,m\min_x f(x) \quad \text{subject to } g_i(x) \leq 0, \; i = 1, \ldots, m

The Lagrangian is L(x,λ)=f(x)+iλigi(x)L(x, \lambda) = f(x) + \sum_i \lambda_i g_i(x) with λi0\lambda_i \geq 0. The dual problem is:

maxλ0infxL(x,λ)\max_{\lambda \geq 0} \inf_x L(x, \lambda)

Weak duality: dual optimal value \leq primal optimal value (always). Strong duality: equality holds (under constraint qualifications like Slater's condition).

Why does duality matter for ML?

  • SVMs: the dual of the SVM problem is a quadratic program in the support vector coefficients, leading to the kernel trick
  • Regularization: the dual of constrained ERM relates to regularized ERM (Lagrangian multiplier \leftrightarrow regularization parameter)
  • Lower bounds: dual certificates provide provable lower bounds on the optimal value, enabling stopping criteria

Canonical Examples

Example

Quadratic (least squares)

f(w)=12Xwy2f(w) = \frac{1}{2}\|Xw - y\|^2 where XRn×dX \in \mathbb{R}^{n \times d}. This is LL-smooth with L=XXop=σmax(X)2L = \|X^\top X\|_{\text{op}} = \sigma_{\max}(X)^2 and (if XX has full rank) μ\mu-strongly convex with μ=σmin(X)2\mu = \sigma_{\min}(X)^2. The condition number is κ=(σmax/σmin)2\kappa = (\sigma_{\max}/\sigma_{\min})^2, the squared condition number of XX.

GD converges in O(κlog(1/ϵ))O(\kappa \log(1/\epsilon)) steps. For well-conditioned XX, this is fast. For ill-conditioned XX (nearly collinear features), this is slow.

Example

Logistic regression

f(w)=1ni=1nlog(1+eyiwxi)f(w) = \frac{1}{n}\sum_{i=1}^n \log(1 + e^{-y_i w^\top x_i}) is convex and smooth (but not strongly convex without regularization). Adding λw2\lambda\|w\|^2 makes it 2λ2\lambda-strongly convex. The smoothness constant depends on xi\|x_i\|: the logistic loss has Lipschitz gradient with constant L=14nixi2+2λL = \frac{1}{4n}\sum_i \|x_i\|^2 + 2\lambda.

Example

Non-smooth: Lasso

f(w)=12nXwy2+λw1f(w) = \frac{1}{2n}\|Xw - y\|^2 + \lambda\|w\|_1. The 1\ell_1 penalty is not smooth (not differentiable at 00). Standard gradient descent does not apply directly. You need proximal gradient descent, which handles the smooth part with a gradient step and the non-smooth part with a "prox" operator. The prox of λw1\lambda\|w\|_1 is soft-thresholding, and the resulting algorithm (ISTA) converges at rate O(1/T)O(1/T).

Common Confusions

Watch Out

Convexity is about the objective, not the model

A linear model trained with a non-convex loss has a non-convex optimization problem. A neural network trained with cross-entropy has a non-convex problem (because the model is non-linear in parameters, even though cross-entropy is convex in predictions). Convexity of the optimization landscape depends on the composition of loss and model.

Watch Out

Smoothness and strong convexity are separate properties

Smoothness bounds curvature from above (ff does not curve too fast). Strong convexity bounds curvature from below (ff curves at least a certain amount). A function can be smooth but not strongly convex (e.g., f(x)=xf(x) = |x| smoothed near 00), or strongly convex but not smooth (e.g., f(x)=x2+xf(x) = x^2 + |x|). You need both for the fast O(κlog(1/ϵ))O(\kappa \log(1/\epsilon)) rate.

Watch Out

The step size 1/L is not always practical

The theoretical step size η=1/L\eta = 1/L requires knowing LL exactly, which is often unknown. In practice, you use line search or adaptive step sizes (Adam, AdaGrad). The theory with η=1/L\eta = 1/L gives a benchmark: this is the best GD can do, and practical methods should match or beat it.

Why Optimization is Not Separate from Learning Theory

The total error of a learning algorithm decomposes as:

R(houtput)Rexcess risk=R(hH)Rapproximation+R(hERM)R(hH)estimation+R(houtput)R(hERM)optimization\underbrace{R(h_{\text{output}}) - R^*}_{\text{excess risk}} = \underbrace{R(h^*_{\mathcal{H}}) - R^*}_{\text{approximation}} + \underbrace{R(h_{\text{ERM}}) - R(h^*_{\mathcal{H}})}_{\text{estimation}} + \underbrace{R(h_{\text{output}}) - R(h_{\text{ERM}})}_{\text{optimization}}

The optimization error is the gap between what the algorithm actually finds and the true ERM minimizer. Convex optimization theory directly controls this third term. If you can solve the ERM problem to ϵ\epsilon-accuracy in TT iterations, and each iteration takes O(nd)O(nd) time, the total computational cost of learning is O(ndT)O(ndT).

Exercises

ExerciseCore

Problem

Prove that the function f(x)=12xAxbxf(x) = \frac{1}{2}x^\top A x - b^\top x is μ\mu-strongly convex and LL-smooth when AA is symmetric with eigenvalues in [μ,L][\mu, L]. What is the condition number?

ExerciseCore

Problem

How many gradient descent iterations are needed to reach ϵ=106\epsilon = 10^{-6} accuracy on a μ\mu-strongly convex, LL-smooth function with κ=100\kappa = 100? Compare with the merely convex case (same LL, x0x=1\|x_0 - x^*\| = 1).

ExerciseAdvanced

Problem

Show that adding 2\ell_2 regularization λw2\lambda\|w\|^2 to a convex, LL-smooth function f(w)f(w) produces a function that is 2λ2\lambda-strongly convex and (L+2λ)(L + 2\lambda)-smooth. What is the new condition number?

Related Comparisons

References

Canonical:

  • Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-3, 9
  • Nesterov, Introductory Lectures on Convex Optimization (2004), Chapters 1-2
  • Nocedal & Wright, Numerical Optimization (2006), Chapters 2-3, 11
  • Bertsekas, Convex Optimization Algorithms (2015), Chapters 1-2

Current:

  • Bubeck, "Convex Optimization: Algorithms and Complexity" (2015), Found. & Trends in ML

  • Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (2018), SIAM Review

Next Topics

Natural next steps from convex optimization:

  • Regularization theory: how regularization connects optimization to generalization
  • Kernels and RKHS: convex optimization in infinite-dimensional function spaces
  • Online convex optimization: extending convex optimization to sequential, adversarial settings

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics