Numerical Optimization
Online Convex Optimization
A general framework for sequential decision-making with convex losses: online gradient descent, follow the regularized leader, adaptive methods, and the O(sqrt(T)) regret guarantee that unifies many algorithms.
Prerequisites
Why This Matters
Online convex optimization (OCO) is a unifying framework. Online gradient descent, AdaGrad, mirror descent, and follow-the-regularized-leader are all instances of one template. By proving regret bounds in the OCO framework, you get guarantees for all of these algorithms simultaneously. The online-to-batch conversion then transfers these guarantees to the standard stochastic optimization setting.
Many algorithms you use daily in ML (Adam, AdaGrad, SGD with momentum) have their tightest known analyses in the OCO framework.
Mental Model
An adversary repeatedly chooses a convex loss function. You must choose a decision before seeing the loss. After rounds, your regret measures how much worse you did compared to the best fixed decision in hindsight. An algorithm is "good" if regret grows sublinearly in , meaning your per-round average loss converges to the best fixed decision.
Formal Setup
Let be a convex, compact decision set. At each round :
- The learner picks
- The adversary reveals a convex loss function
- The learner incurs loss
Regret
The regret after rounds is:
The comparator is the single best fixed decision in hindsight.
Online Gradient Descent (OGD)
Starting from , OGD updates:
where is the Euclidean projection onto and is the step size at round .
Follow the Regularized Leader (FTRL)
FTRL selects:
where is a strongly convex regularizer. With , FTRL is closely related to OGD.
Main Theorems
Online Gradient Descent Regret Bound
Statement
Let have diameter and suppose for all . With step size , OGD satisfies:
With the optimal constant step size (requiring knowledge of ):
Intuition
Each step of OGD reduces regret against the comparator by following the gradient, but each step also incurs error from the noisy gradient. The comes from balancing cumulative step errors () against cumulative gradient tracking (). Optimizing over gives .
Proof Sketch
For any comparator , use the three-point identity:
Sum over . The first term telescopes. The second sums to . With , the total is .
Why It Matters
The regret bound is optimal for general convex losses (matching lower bounds). This means average regret , so OGD is a no-regret algorithm. The proof technique (telescoping via Bregman divergences) is the template for analyzing nearly all first-order online methods.
Failure Mode
The bound assumes bounded gradients and a bounded domain. For unconstrained optimization or heavy-tailed gradients, the guarantee breaks. For strongly convex losses, the rate is loose; is achievable and optimal.
FTRL Regret Bound
Statement
Let be -strongly convex with respect to norm on . Define the dual norm bound . FTRL with regularizer satisfies:
With optimal , this gives .
Intuition
FTRL balances two forces: the regularizer keeps decisions stable (preventing oscillation), while the cumulative loss pulls decisions toward the best action. The regret bound trades off the initial penalty from regularization against the stability gained.
Proof Sketch
Define . Since , use the strong convexity of to bound . Telescope and use the fact that .
Why It Matters
FTRL generalizes OGD to arbitrary geometries via the choice of regularizer. With , you recover OGD. With (negative entropy) on the simplex, you recover the multiplicative weights/hedge algorithm. One framework, many algorithms.
Failure Mode
Choosing the right regularizer requires knowing the geometry of the problem. A poorly matched regularizer (e.g., Euclidean regularization on the simplex) gives suboptimal regret. Computing the FTRL update requires solving a convex optimization problem at each step, which may be expensive for complex and .
Adaptive Methods: AdaGrad as OCO
AdaGrad Diagonal Regret Bound
Statement
AdaGrad (diagonal version) uses per-coordinate step sizes where is the -th component of . Its regret satisfies:
If all gradient components are bounded by , this is at most . But if gradients are sparse (most components are zero), the bound can be much smaller.
Intuition
AdaGrad gives larger step sizes to infrequent features and smaller step sizes to frequent ones. Coordinates with small cumulative gradient magnitude get more aggressive updates. This is automatic adaptation without needing to know the gradient distribution in advance.
Proof Sketch
Apply the FTRL framework with a time-varying regularizer . The regret bound follows from bounding the regularizer term and the gradient term separately. The key insight is that the regularizer grows with the observed gradients, automatically balancing stability and responsiveness.
Why It Matters
AdaGrad shows that adaptive step sizes emerge naturally from the OCO framework. The online-to-batch conversion of AdaGrad gives convergence guarantees for stochastic optimization that adapt to the gradient noise structure. This is the theoretical foundation for Adam and other adaptive gradient descent variants used in deep learning.
Failure Mode
AdaGrad's step sizes monotonically decrease, which can be too aggressive for non-stationary problems. In deep learning, this motivates Adam (which uses exponential moving averages instead of cumulative sums). The diagonal approximation ignores correlations between coordinates.
Online-to-Batch Conversion
If an online algorithm achieves against adversarial losses, it also works for stochastic optimization. Given i.i.d. losses drawn from a distribution, the average iterate satisfies:
where . So regret gives optimization error. This is the standard SGD rate for convex optimization, derived with no extra work.
Connection to Bandit Optimization
In the bandit setting, you observe only the loss value , not the gradient. You must estimate gradients from function values alone. The standard approach: pick a random perturbation on the unit sphere and estimate the gradient via:
This is an unbiased estimator of the gradient of a smoothed version of . Plugging into OGD gives regret in the bandit setting, a factor of worse than the full-information case.
Common Confusions
Regret is not the same as convergence
An algorithm with regret does not necessarily converge to a single point. The iterates may keep moving. What converges is the average performance: the per-round regret .
The comparator is fixed, not adaptive
Standard regret compares against the single best fixed decision in hindsight. It does not compare against a sequence of decisions that changes over time. Competing with changing comparators requires adaptive regret or dynamic regret, which have different (larger) bounds.
OGD is not the same as SGD
OGD operates on adversarial loss sequences with no distributional assumptions. SGD assumes i.i.d. stochastic gradients. OGD's regret bound implies SGD's convergence bound via online-to-batch conversion, but not the other way around. OGD is the stronger result.
Key Takeaways
- OCO: learner picks , adversary reveals convex , regret measures cumulative gap
- OGD achieves regret, optimal for general convex losses
- FTRL unifies OGD, multiplicative weights, and mirror descent via regularizer choice
- AdaGrad adapts step sizes using cumulative gradient information, excels with sparse gradients
- Online-to-batch conversion: regret implies stochastic optimization rate
- Strongly convex losses give regret, much faster than the general
Exercises
Problem
OGD with constant step size achieves regret . What is the per-round average regret after rounds with and ?
Problem
Show that for strongly convex losses (each is -strongly convex), OGD with step size achieves . Sketch the proof by modifying the standard OGD analysis.
References
Canonical:
- Shalev-Shwartz, Online Learning and Online Convex Optimization, Foundations and Trends in ML (2012)
- Hazan, Introduction to Online Convex Optimization (2016), Chapters 1-5
Current:
- Orabona, A Modern Introduction to Online Learning (2023), Chapters 2-7
- Duchi, Hazan, Singer, "Adaptive Subgradient Methods for Online Learning" (JMLR, 2011)
Next Topics
Natural extensions from OCO:
- Stochastic optimization theory: from online regret to convergence rates for i.i.d. losses
- Bandit algorithms: what happens when you only observe the loss value, not the gradient
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- No-Regret LearningLayer 3