Convex Optimization Basics

Sneiderman, Robby

Optimization Function Classes

Convex Optimization Basics

Convex sets, convex functions, gradient descent convergence, strong convexity, and duality: the optimization foundation that every learning-theoretic result silently depends on.

CoreTier 1StableCore spine~25 min

Prerequisites

Differentiation in Rn Matrix Operations and Properties Common Inequalities Continuity in Rn

Start 8-question practice · 32 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

optimization-function-classes | layer 1 | tier 1. This page has 9 direct prerequisites and 38 published dependents.

Open Atlas Prerequisites Leads to

What next

Regularization Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Loading gradient descent demo...

Many learning algorithms are implemented by solving an optimization problem. When you minimize empirical risk, fit a kernel SVM, train logistic regression, or tune a regularized linear model, you are running an optimization algorithm on a loss landscape. Convex optimization is the special case where this landscape has no local minima traps: every local minimum is global.

This is not a "separate" subject from learning theory. The convergence rate of your optimizer determines the computational cost of learning. The properties of your objective (smoothness, strong convexity) determine both how fast you can optimize and how well the result generalizes (via algorithmic stability). Optimization and generalization are entangled.

Mental Model

A convex function is a bowl: any line segment between two points on the graph lies above the graph. This means gradient descent, which always moves "downhill," will reach the bottom. The questions are:

How fast? (convergence rate)
How close to the true minimum? (optimization error)
What properties of the function control the answers?

Formal Setup: Convex Sets and Functions

Definition

Convex Set

A set $\mathcal{C} \subseteq \mathbb{R}^d$ is convex if and only if for all $x, y \in \mathcal{C}$ and all $\theta \in [0, 1]$ :

$\theta x + (1-\theta) y \in \mathcal{C}$

Every point on the line segment between $x$ and $y$ stays in $\mathcal{C}$ . Examples: balls, halfspaces, polyhedra, the probability simplex. Non-examples: the union of two disjoint balls, the set $\{x : \|x\| \geq 1\}$ .

Definition

Convex Function $f : C \to R$

A function $f: \mathcal{C} \to \mathbb{R}$ on a convex set $\mathcal{C}$ is convex if and only if for all $x, y \in \mathcal{C}$ and $\theta \in [0, 1]$ :

$f(\theta x + (1-\theta) y) \leq \theta f(x) + (1-\theta) f(y)$

Equivalently, $f$ lies below its chords. If the inequality is strict for $x \neq y$ and $\theta \in (0, 1)$ , $f$ is strictly convex.

Definition

First-Order Condition for Convexity

If $f$ is differentiable, then $f$ is convex if and only if for all $x, y$ :

$f(y) \geq f(x) + \nabla f(x)^\top (y - x)$

This says: $f$ lies above every tangent hyperplane. The tangent at any point is a global lower bound on $f$ . This single inequality is the most useful characterization of convexity in practice.

Definition

Smoothness (L-smooth) $L$

A differentiable function $f$ is $L$ -smooth if and only if $\nabla f$ is $L$ -Lipschitz:

$\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\| \quad \forall x, y$

Equivalently, for all $x, y$ :

$f(y) \leq f(x) + \nabla f(x)^\top(y-x) + \frac{L}{2}\|y - x\|^2$

Smoothness says the function does not curve too sharply. $L$ is the maximum curvature. This is the key property that makes gradient descent work: it guarantees that a gradient step of size $1/L$ makes sufficient progress.

Definition

Strong Convexity $μ$

A function $f$ is $\mu$ -strongly convex (with $\mu > 0$ ) if and only if for all $x, y$ :

$f(y) \geq f(x) + \nabla f(x)^\top(y-x) + \frac{\mu}{2}\|y - x\|^2$

This says $f$ curves at least as much as a quadratic with curvature $\mu$ . Strong convexity implies a unique minimizer $x^*$ and gives:

$f(x) - f(x^*) \geq \frac{\mu}{2}\|x - x^*\|^2$

Definition

Condition Number $κ = L / μ$

The condition number of an $L$ -smooth, $\mu$ -strongly convex function is $\kappa = L/\mu$ . It measures how "elongated" the level sets are. $\kappa = 1$ means the function is a perfect isotropic quadratic. Large $\kappa$ means the function is badly conditioned: steep in some directions, flat in others.

The condition number controls the convergence rate of gradient descent: $O((1 - 1/\kappa)^t)$ , so ill-conditioned problems converge slowly.

Gradient Descent

The gradient descent algorithm with step size $\eta > 0$ :

$x_{t+1} = x_t - \eta \nabla f(x_t)$

Starting from $x_0$ , repeat until convergence. The step size $\eta$ is the single most important hyperparameter. Too large: diverge. Too small: crawl.

Main Theorems

Theorem

GD Convergence for Smooth Convex Functions

Statement

If $f$ is convex and $L$ -smooth, then gradient descent with step size $\eta = 1/L$ satisfies:

$f(x_T) - f(x^*) \leq \frac{L\|x_0 - x^*\|^2}{2T}$

That is, the optimization error decreases as $O(1/T)$ .

Intuition

Each gradient step with $\eta = 1/L$ makes progress proportional to $\|\nabla f(x_t)\|^2/L$ . Even when gradients become small near the optimum, the convexity of $f$ ensures that the iterates steadily approach $x^*$ . The $1/T$ rate means you need $T = O(1/\epsilon)$ iterations to reach $\epsilon$ -accuracy.

Proof Sketch

Step 1 (Descent Lemma): By $L$ -smoothness, choosing $\eta = 1/L$ gives:

$f(x_{t+1}) \leq f(x_t) - \frac{1}{2L}\|\nabla f(x_t)\|^2$

Step 2: By convexity, $f(x_t) - f(x^*) \leq \nabla f(x_t)^\top(x_t - x^*) \leq \|\nabla f(x_t)\| \cdot \|x_t - x^*\|$ .

Step 3: Track $\|x_t - x^*\|^2$ . By the update rule:

$\|x_{t+1} - x^*\|^2 = \|x_t - x^*\|^2 - \frac{2}{L}\nabla f(x_t)^\top(x_t - x^*) + \frac{1}{L^2}\|\nabla f(x_t)\|^2$

Combine using convexity: $\nabla f(x_t)^\top(x_t - x^*) \geq f(x_t) - f(x^*)$ . Sum from $t = 0$ to $T-1$ . The left side telescopes. Rearrange to get the $O(1/T)$ bound.

Why It Matters

This is the baseline convergence result. The $O(1/T)$ rate is slow: halving the optimality gap requires doubling the step count. To improve on this you need either (a) strong convexity, which gives a linear rate, or (b) acceleration, where Nesterov's method gives $O(1/T^2)$ and matches the lower bound for first-order methods on smooth convex functions.

In ML, this rate determines the number of passes over the data needed to approximately solve the ERM problem.

Failure Mode

The bound scales with $\|x_0 - x^*\|^2$ : bad initialization hurts. Also, the $O(1/T)$ rate applies to the function value gap, not the parameter distance $\|x_T - x^*\|$ . For the parameter distance, you need strong convexity.

report a correction →

Theorem

GD Convergence for Smooth and Strongly Convex Functions

Statement

If $f$ is $\mu$ -strongly convex and $L$ -smooth, then gradient descent with $\eta = 1/L$ satisfies:

$f(x_T) - f(x^*) \leq \frac{L}{2}\Bigl(1 - \frac{1}{\kappa}\Bigr)^T \|x_0 - x^*\|^2$

where $\kappa = L/\mu$ is the condition number. The convergence is linear (exponential decrease): the error contracts by a factor $(1 - 1/\kappa)$ per iteration.

Intuition

Strong convexity ensures that when you are far from $x^*$ , the gradient is large, so gradient descent makes large steps. When you are close, the gradient is small, but you do not need large steps. The exponential convergence rate means the number of iterations to reach $\epsilon$ -accuracy is $O(\kappa \log(1/\epsilon))$ , logarithmic in the target accuracy.

Proof Sketch

By the descent lemma: $f(x_{t+1}) \leq f(x_t) - \frac{1}{2L}\|\nabla f(x_t)\|^2$ .

By strong convexity: $\|\nabla f(x_t)\|^2 \geq 2\mu(f(x_t) - f(x^*))$ .

Combining: $f(x_{t+1}) - f(x^*) \leq (1 - \mu/L)(f(x_t) - f(x^*))$ .

Iterate: $f(x_T) - f(x^*) \leq (1 - 1/\kappa)^T (f(x_0) - f(x^*))$ .

Replace $f(x_0) - f(x^*)$ with $\frac{L}{2}\|x_0 - x^*\|^2$ by smoothness.

Why It Matters

This is the standard convergence result for regularized ERM. Adding $\ell_2$ regularization $\lambda\|w\|^2$ makes any convex loss $\mu$ -strongly convex with $\mu = 2\lambda$ . The condition number becomes $\kappa = L/(2\lambda)$ , and the optimization converges in $O((L/\lambda)\log(1/\epsilon))$ iterations.

This directly connects to algorithmic stability: the same strong convexity that gives fast optimization also gives stability parameter $O(1/(\lambda n))$ .

Failure Mode

Large condition number $\kappa$ means slow convergence. For ridge regression with small regularization, $\kappa$ can be enormous. Preconditioning or second-order methods can help, but at higher per-iteration cost.

report a correction →

Duality Preview

Every convex optimization problem has a dual problem. For the primal:

$\min_x f(x) \quad \text{subject to } g_i(x) \leq 0, \; i = 1, \ldots, m$

The Lagrangian is $L(x, \lambda) = f(x) + \sum_i \lambda_i g_i(x)$ with $\lambda_i \geq 0$ . The dual problem is:

$\max_{\lambda \geq 0} \inf_x L(x, \lambda)$

Weak duality: dual optimal value $\leq$ primal optimal value (always). Strong duality: equality holds (under constraint qualifications like Slater's condition).

Why does duality matter for ML?

SVMs: the dual of the SVM problem is a quadratic program in the support vector coefficients, leading to the kernel trick
Regularization: the dual of constrained ERM relates to regularized ERM (Lagrangian multiplier $\leftrightarrow$ regularization parameter)
Lower bounds: dual certificates provide provable lower bounds on the optimal value, enabling stopping criteria

Canonical Examples

Example

Quadratic (least squares)

$f(w) = \frac{1}{2}\|Xw - y\|^2$ where $X \in \mathbb{R}^{n \times d}$ . This is $L$ -smooth with $L = \|X^\top X\|_{\text{op}} = \sigma_{\max}(X)^2$ and (if $X$ has full rank) $\mu$ -strongly convex with $\mu = \sigma_{\min}(X)^2$ . The condition number is $\kappa = (\sigma_{\max}/\sigma_{\min})^2$ , the squared condition number of $X$ .

GD converges in $O(\kappa \log(1/\epsilon))$ steps. For well-conditioned $X$ , this is fast. For ill-conditioned $X$ (nearly collinear features), this is slow.

Example

Logistic regression

$f(w) = \frac{1}{n}\sum_{i=1}^n \log(1 + e^{-y_i w^\top x_i})$ is convex and smooth (but not strongly convex without regularization). Adding $\lambda\|w\|^2$ makes it $2\lambda$ -strongly convex. The smoothness constant depends on $\|x_i\|$ : the logistic loss has Lipschitz gradient with constant $L = \frac{1}{4n}\sum_i \|x_i\|^2 + 2\lambda$ .

Example

Non-smooth: Lasso

$f(w) = \frac{1}{2n}\|Xw - y\|^2 + \lambda\|w\|_1$ . The $\ell_1$ penalty is not smooth (not differentiable at $0$ ). Standard gradient descent does not apply directly. You need proximal gradient descent, which handles the smooth part with a gradient step and the non-smooth part with a "prox" operator. The prox of $\lambda\|w\|_1$ is soft-thresholding, and the resulting algorithm (ISTA) converges at rate $O(1/T)$ .

Evidence Ladder for Optimization Claims

Optimization claims in ML papers often sound similar but require different evidence. A proof that the population objective is convex does not prove that the implemented training loop reaches the statistical optimum.

Claim	Minimum supporting evidence	Common failure mode
Objective is convex	Hessian is positive semidefinite, or the loss is built from convex-preserving operations on a convex domain	Regularizer, constraint, or parameterization quietly makes the implemented objective nonconvex
Gradient descent converges	Smoothness constant $L$ , step-size rule, and the theorem's iterate assumptions are stated	Claimed rate uses $\eta = 1/L$ while code uses Adam, momentum, clipping, or a scheduler
Strong-convex rate applies	Explicit $\mu$ -strong convexity or regularization plus a condition-number estimate	Loss is convex but not strongly convex because the design matrix is rank-deficient
Dual certificate proves near-optimality	Primal value, dual value, and feasibility gap are all reported at the same tolerance	Solver reports low training loss but violates constraints or has a loose duality gap
Optimization error is negligible for learning	Generalization bound, optimization gap, and statistical error are compared on the same scale	Training stops early and the remaining optimization gap is larger than the claimed statistical rate

For TheoremPath pages, this table is the working standard: if a learning-theory argument cites an optimization result, it should name which row it needs rather than citing "convex optimization" generically.

From Theorem to Training Run

A convex theorem is not automatically an implementation certificate. The mathematical object in the proof and the object optimized in code must match. Use this checklist when reading an ML claim that cites convex optimization.

Identify the actual objective. Is the optimized function the stated convex loss, or did the implementation add clipping, normalization, nonconvex parameterization, early stopping, or a learned feature map?
Name the geometry constants. A convergence-rate claim should state the relevant $L$ , $\mu$ , condition number, diameter, or duality gap. If those quantities are unknown, the theorem is a guide, not a measured certificate.
Separate optimizer family from theorem family. A rate for fixed-step gradient descent does not directly justify Adam, momentum, line search, stochastic minibatches, or gradient clipping. Those methods need their own assumptions or an empirical optimization-gap check.
Report the right residual. Training loss is not the same as a primal-dual gap, constraint violation, gradient norm, or distance to $x^*$ . Pick the residual that matches the claim.
Compare against statistical scale. In learning theory, the optimization gap only matters relative to estimation and approximation error. Solving the empirical problem to $10^{-10}$ is wasted work if the statistical error is $10^{-2}$ .

Example

Kernel SVM as a clean convex case

The soft-margin SVM objective is convex in the margin weights, and its dual is a quadratic program with box constraints on the support-vector coefficients. A solver can therefore report both primal and dual values. If the primal-dual gap is below tolerance and the constraints are feasible, the optimization claim is auditable.

Example

Neural-network cross-entropy is not a convex training problem

Cross-entropy is convex as a function of predicted class probabilities, but a deep network's predicted probabilities are a nonconvex function of its weights. The composed training objective can have saddles, flat directions, and multiple basins. Convex-loss facts still help analyze calibration or surrogate losses, but they do not prove that SGD found a global optimum of the neural-network training objective.

Worked Optimization Certificate

For a regularized logistic-regression or kernel-SVM claim, a useful certificate should connect the theorem to the run that produced the model. The following audit is stronger than "the loss went down."

Certificate item	What to report	Why it matters
Objective identity	Exact loss, regularizer, constraints, feature map, and any clipping or normalization	The proof may apply to a different objective than the code
Step-size contract	Fixed step, line search, proximal step, or adaptive method, plus the theorem family it invokes	A fixed-step theorem does not certify Adam or scheduler behavior
First-order residual	Gradient norm, prox-gradient norm, or KKT residual at the returned iterate	Low training loss can still be far from stationarity
Dual evidence	Primal value, dual value, primal-dual gap, and constraint violation when a dual is available	The optimization gap is measurable rather than assumed
Statistical scale	Compare the optimization gap with the estimation error or validation uncertainty	Solving beyond the statistical error scale does not improve the learning claim

The strongest statement is not "the optimizer converged" but "the measured optimization gap is below the statistical error scale for this learning problem." That is the point where optimization theory and learning evidence meet.

Common Confusions

Watch Out

Convexity is about the objective, not the model

A linear model trained with a non-convex loss has a non-convex optimization problem. A neural network trained with cross-entropy has a non-convex problem (because the model is non-linear in parameters, even though cross-entropy is convex in predictions). Convexity of the optimization landscape depends on the composition of loss and model.

Watch Out

Smoothness and strong convexity are separate properties

Smoothness bounds curvature from above ( $f$ does not curve too fast). Strong convexity bounds curvature from below ( $f$ curves at least a certain amount). A function can be smooth but not strongly convex (e.g., $f(x) = |x|$ smoothed near $0$ ), or strongly convex but not smooth (e.g., $f(x) = x^2 + |x|$ ). You need both for the fast $O(\kappa \log(1/\epsilon))$ rate.

Watch Out

The step size 1/L is not always practical

The theoretical step size $\eta = 1/L$ requires knowing $L$ exactly, which is often unknown. In practice, you use line search or adaptive step sizes (Adam, AdaGrad). The theory with $\eta = 1/L$ gives a benchmark: this is the best GD can do, and practical methods should match or beat it.

Why Optimization is Not Separate from Learning Theory

The total error of a learning algorithm decomposes as:

$\underbrace{R(h_{\text{output}}) - R^*}_{\text{excess risk}} = \underbrace{R(h^*_{\mathcal{H}}) - R^*}_{\text{approximation}} + \underbrace{R(h_{\text{ERM}}) - R(h^*_{\mathcal{H}})}_{\text{estimation}} + \underbrace{R(h_{\text{output}}) - R(h_{\text{ERM}})}_{\text{optimization}}$

The optimization error is the gap between what the algorithm actually finds and the true ERM minimizer. Convex optimization theory directly controls this third term. If you can solve the ERM problem to $\epsilon$ -accuracy in $T$ iterations, and each iteration takes $O(nd)$ time, the total computational cost of learning is $O(ndT)$ .

Exercises

ExerciseCore

Problem

Prove that the function $f(x) = \frac{1}{2}x^\top A x - b^\top x$ is $\mu$ -strongly convex and $L$ -smooth when $A$ is symmetric with eigenvalues in $[\mu, L]$ . What is the condition number?

ExerciseCore

Problem

How many gradient descent iterations are needed to reach $\epsilon = 10^{-6}$ accuracy on a $\mu$ -strongly convex, $L$ -smooth function with $\kappa = 100$ ? Compare with the merely convex case (same $L$ , $\|x_0 - x^*\| = 1$ ).

ExerciseAdvanced

Problem

Show that adding $\ell_2$ regularization $\lambda\|w\|^2$ to a convex, $L$ -smooth function $f(w)$ produces a function that is $2\lambda$ -strongly convex and $(L + 2\lambda)$ -smooth. What is the new condition number?

Related Comparisons

References

Canonical:

Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-3, 9
Nesterov, Introductory Lectures on Convex Optimization (2004), Chapters 1-2
Nocedal & Wright, Numerical Optimization (2006), Chapters 2-3, 11
Bertsekas, Convex Optimization Algorithms (2015), Chapters 1-2

Current:

Bubeck, "Convex Optimization: Algorithms and Complexity" (2015), Foundations and Trends in Machine Learning, Sections 3-4
Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (2018), SIAM Review, Sections 4-6

Next Topics

Natural next steps from convex optimization:

Regularization theory: how regularization connects optimization to generalization
Kernels and RKHS: convex optimization in infinite-dimensional function spaces
Online convex optimization: extending convex optimization to sequential, adversarial settings

Last reviewed: July 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

9

Common Inequalitieslayer 0A · tier 1
Continuity in Rⁿlayer 0A · tier 1
Differentiation in Rⁿlayer 0A · tier 1
Dynamic Programminglayer 0A · tier 1
Matrix Operations and Propertieslayer 0A · tier 1

Derived topics

38

Maximum A Posteriori (MAP) Estimationlayer 0B · tier 1
Activation Functionslayer 1 · tier 1
Gradient Descent Variantslayer 1 · tier 1
K-Means Clusteringlayer 1 · tier 1
Logistic Regressionlayer 1 · tier 1

+33 more on the derived-topics page.

Graph-backed continuations

Regularization Theory Kernels and Reproducing Kernel Hilbert Spaces Activation Functions Ascent Algorithms and Hill Climbing Augmented Lagrangian and ADMM Bounded Rationality Convex Duality Coordinate Descent The EM Algorithm Expected Utility Theory Game Theory Foundations Gradient Descent Variants Interior Point Methods K-Means Clustering Lasso Regression Line Search Methods Logistic Regression Markov Decision Processes Minimax and Saddle Points Mirror Descent and Frank-Wolfe Nash Equilibrium Newton's Method Online Convex Optimization Optimizer Theory: SGD, Adam, and Muon Policy Gradient Theorem Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient Projected Gradient Descent Proximal Gradient Methods Ridge Regression Riemannian Optimization and Manifold Constraints Scaling Laws Stability and Optimization Dynamics Stochastic Approximation Theory Subgradients and Subdifferentials Support Vector Machines Training Dynamics and Loss Landscapes The Kernel Trick Maximum A Posteriori (MAP) Estimation