Online Convex Optimization

Sneiderman, Robby

Numerical Optimization

Online Convex Optimization

A general framework for sequential decision-making with convex losses: online gradient descent, follow the regularized leader, adaptive methods, and the square-root-T regret guarantee that unifies many algorithms.

AdvancedTier 2StableSupporting~55 min

Prerequisites

Convex Optimization Basics No Regret Learning

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

numerical-optimization | layer 3 | tier 2. This page has 2 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Multi-Armed Bandits Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Online convex optimization (OCO) is a unifying framework. Online gradient descent, AdaGrad, mirror descent, and follow-the-regularized-leader are all instances of one template. By proving regret bounds in the OCO framework, you get guarantees for all of these algorithms simultaneously. The online-to-batch conversion then transfers these guarantees to the standard stochastic optimization setting.

Many algorithms you use daily in ML (Adam, AdaGrad, SGD with momentum) have their tightest known analyses in the OCO framework.

Mental Model

An adversary repeatedly chooses a convex loss function. You must choose a decision before seeing the loss. After $T$ rounds, your regret measures how much worse you did compared to the best fixed decision in hindsight. In this setting, "good" usually means regret grows sublinearly in $T$ , so your per-round average loss converges to the best fixed decision.

Formal Setup

Let $\mathcal{K} \subseteq \mathbb{R}^d$ be a convex, compact decision set. At each round $t = 1, \ldots, T$ :

The learner picks $w_t \in \mathcal{K}$
The adversary reveals a convex loss function $f_t : \mathcal{K} \to \mathbb{R}$
The learner incurs loss $f_t(w_t)$

Definition

Regret $Regret_{T}$

The regret after $T$ rounds is:

$\text{Regret}_T = \sum_{t=1}^{T} f_t(w_t) - \min_{w \in \mathcal{K}} \sum_{t=1}^{T} f_t(w)$

The comparator is the single best fixed decision in hindsight.

Definition

Online Gradient Descent (OGD)

Starting from $w_1 \in \mathcal{K}$ , OGD updates:

$w_{t+1} = \Pi_{\mathcal{K}}\left(w_t - \eta_t \nabla f_t(w_t)\right)$

where $\Pi_{\mathcal{K}}$ is the Euclidean projection onto $\mathcal{K}$ and $\eta_t$ is the step size at round $t$ .

Definition

Follow the Regularized Leader (FTRL)

FTRL selects:

$w_{t+1} = \arg\min_{w \in \mathcal{K}} \left(\sum_{s=1}^{t} f_s(w) + R(w)\right)$

where $R: \mathcal{K} \to \mathbb{R}$ is a strongly convex regularizer. With $R(w) = \frac{1}{2\eta}\|w\|^2$ , FTRL is closely related to OGD.

Main Theorems

Theorem

Online Gradient Descent Regret Bound

Statement

Let $\mathcal{K}$ have diameter $D = \sup_{u,v \in \mathcal{K}} \|u - v\|$ and suppose $\|\nabla f_t(w)\| \leq G$ for all $t, w$ . With step size $\eta_t = D/(G\sqrt{t})$ , OGD satisfies:

$\text{Regret}_T \leq \frac{3}{2} DG\sqrt{T}$

With the optimal constant step size $\eta = D/(G\sqrt{T})$ (requiring knowledge of $T$ ):

$\text{Regret}_T \leq DG\sqrt{T}$

Intuition

Each step of OGD reduces regret against the comparator by following the gradient, but each step also incurs error from the noisy gradient. The $\sqrt{T}$ comes from balancing cumulative step errors ( $\sim \eta T$ ) against cumulative gradient tracking ( $\sim T/\eta$ ). Optimizing over $\eta$ gives $\sqrt{T}$ .

Proof Sketch

For any comparator $u \in \mathcal{K}$ , use the three-point identity:

$f_t(w_t) - f_t(u) \leq \langle \nabla f_t(w_t), w_t - u \rangle \leq \frac{\|w_t - u\|^2 - \|w_{t+1} - u\|^2}{2\eta_t} + \frac{\eta_t}{2}\|\nabla f_t(w_t)\|^2$

Sum over $t = 1, \ldots, T$ . The first term telescopes. The second sums to $\frac{G^2}{2}\sum_t \eta_t$ . With $\eta_t = D/(G\sqrt{t})$ , the total is $O(DG\sqrt{T})$ .

Why It Matters

The $O(\sqrt{T})$ regret bound is optimal for general convex losses (matching lower bounds). This means average regret $\text{Regret}_T/T = O(1/\sqrt{T}) \to 0$ , so OGD is a no-regret algorithm. The proof technique (telescoping via Bregman divergences) is the template for analyzing nearly all first-order online methods.

Failure Mode

The bound assumes bounded gradients and a bounded domain. For unconstrained optimization or heavy-tailed gradients, the guarantee breaks. For strongly convex losses, the $O(\sqrt{T})$ rate is loose; $O(\log T)$ is achievable and optimal.

report a correction →

Theorem

FTRL Regret Bound

Statement

Let $R$ be $\sigma$ -strongly convex with respect to norm $\|\cdot\|$ on $\mathcal{K}$ . Define the dual norm bound $G_* = \max_t \|\nabla f_t\|_*$ . FTRL with regularizer $\eta^{-1} R$ satisfies:

$\text{Regret}_T \leq \frac{\max_{w \in \mathcal{K}} R(w) - \min_{w \in \mathcal{K}} R(w)}{\eta} + \frac{\eta}{2\sigma}\sum_{t=1}^{T} \|\nabla f_t(w_t)\|_*^2$

With optimal $\eta$ , this gives $\text{Regret}_T = O(G_* \sqrt{T/\sigma} \cdot \sqrt{\max R - \min R})$ .

Intuition

FTRL balances two forces: the regularizer keeps decisions stable (preventing oscillation), while the cumulative loss pulls decisions toward the best action. The regret bound trades off the initial penalty from regularization against the stability gained.

Proof Sketch

Define $\Phi_t(w) = \sum_{s=1}^{t} f_s(w) + \eta^{-1} R(w)$ . Since $w_{t+1} = \arg\min \Phi_t$ , use the strong convexity of $\Phi_t$ to bound $f_t(w_t) - f_t(w_{t+1})$ . Telescope and use the fact that $\Phi_0(w_1) - \Phi_0(u) \leq \eta^{-1}(R(w_1) - R(u))$ .

Why It Matters

FTRL generalizes OGD to arbitrary geometries via the choice of regularizer. With $R(w) = \frac{1}{2}\|w\|_2^2$ , you recover OGD. With $R(w) = \sum_i w_i \log w_i$ (negative entropy) on the simplex, you recover the multiplicative weights/hedge algorithm. One framework, many algorithms.

Failure Mode

Choosing the right regularizer requires knowing the geometry of the problem. A poorly matched regularizer (e.g., Euclidean regularization on the simplex) gives suboptimal regret. Computing the FTRL update requires solving a convex optimization problem at each step, which may be expensive for complex $R$ and $\mathcal{K}$ .

report a correction →

Adaptive Methods: AdaGrad as OCO

Theorem

AdaGrad Diagonal Regret Bound

Statement

AdaGrad (diagonal version) uses per-coordinate step sizes $\eta_{t,j} = D/(2\sqrt{\sum_{s=1}^{t} g_{s,j}^2})$ where $g_{t,j}$ is the $j$ -th component of $\nabla f_t(w_t)$ . Its regret satisfies:

$\text{Regret}_T \leq 2D \sum_{j=1}^{d} \sqrt{\sum_{t=1}^{T} g_{t,j}^2}$

If all gradient components are bounded by $G$ , this is at most $2DG\sqrt{dT}$ . But if gradients are sparse (most components are zero), the bound can be much smaller.

Intuition

AdaGrad gives larger step sizes to infrequent features and smaller step sizes to frequent ones. Coordinates with small cumulative gradient magnitude get more aggressive updates. This is automatic adaptation without needing to know the gradient distribution in advance.

Proof Sketch

Apply the FTRL framework with a time-varying regularizer $R_t(w) = \sum_j \frac{\sqrt{\sum_{s=1}^{t} g_{s,j}^2}}{2D} w_j^2$ . The regret bound follows from bounding the regularizer term and the gradient term separately. The key insight is that the regularizer grows with the observed gradients, automatically balancing stability and responsiveness.

Why It Matters

AdaGrad shows that adaptive step sizes emerge naturally from the OCO framework. The online-to-batch conversion of AdaGrad gives convergence guarantees for stochastic optimization that adapt to the gradient noise structure. This is the theoretical foundation for Adam and other adaptive gradient descent variants used in deep learning.

Failure Mode

AdaGrad's step sizes monotonically decrease, which can be too aggressive for non-stationary problems. In deep learning, this motivates Adam (which uses exponential moving averages instead of cumulative sums). The diagonal approximation ignores correlations between coordinates.

report a correction →

Online-to-Batch Conversion

If an online algorithm achieves $\text{Regret}_T = O(\sqrt{T})$ against adversarial losses, it also works for stochastic optimization. Given i.i.d. losses $f_1, \ldots, f_T$ drawn from a distribution, the average iterate $\bar{w} = \frac{1}{T}\sum_t w_t$ satisfies:

$\mathbb{E}[f(\bar{w})] - \min_{w \in \mathcal{K}} f(w) \leq \frac{\text{Regret}_T}{T}$

where $f(w) = \mathbb{E}[f_t(w)]$ . So $O(\sqrt{T})$ regret gives $O(1/\sqrt{T})$ optimization error. This is the standard SGD rate for convex optimization, derived with no extra work.

Connection to Bandit Optimization

In the bandit setting, you observe only the loss value $f_t(w_t)$ , not the gradient. You must estimate gradients from function values alone. The standard approach: pick a random perturbation $u_t$ on the unit sphere and estimate the gradient via:

$\hat{g}_t = \frac{d}{\delta} f_t(w_t + \delta u_t) u_t$

This is an unbiased estimator of the gradient of a smoothed version of $f_t$ . Plugging into OGD gives $O(d\sqrt{T})$ regret in the bandit setting, a factor of $d$ worse than the full-information case.

Common Confusions

Watch Out

Regret is not the same as convergence

An algorithm with $O(\sqrt{T})$ regret does not necessarily converge to a single point. The iterates $w_t$ may keep moving. What converges is the average performance: the per-round regret $\text{Regret}_T/T \to 0$ .

Watch Out

The comparator is fixed, not adaptive

Standard regret compares against the single best fixed decision in hindsight. It does not compare against a sequence of decisions that changes over time. Competing with changing comparators requires adaptive regret or dynamic regret, which have different (larger) bounds.

Watch Out

OGD is not the same as SGD

OGD operates on adversarial loss sequences with no distributional assumptions. SGD assumes i.i.d. stochastic gradients. OGD's regret bound implies SGD's convergence bound via online-to-batch conversion, but not the other way around. OGD is the stronger result.

Summary

OCO: learner picks $w_t$ , adversary reveals convex $f_t$ , regret measures cumulative gap
OGD achieves $O(DG\sqrt{T})$ regret, optimal for general convex losses
FTRL unifies OGD, multiplicative weights, and mirror descent via regularizer choice
AdaGrad adapts step sizes using cumulative gradient information, excels with sparse gradients
Online-to-batch conversion: $O(\sqrt{T})$ regret implies $O(1/\sqrt{T})$ stochastic optimization rate
Strongly convex losses give $O(\log T)$ regret, much faster than the general $O(\sqrt{T})$

Exercises

ExerciseCore

Problem

OGD with constant step size $\eta = D/(G\sqrt{T})$ achieves regret $DG\sqrt{T}$ . What is the per-round average regret after $T = 10{,}000$ rounds with $D = 1$ and $G = 1$ ?

ExerciseAdvanced

Problem

Show that for strongly convex losses (each $f_t$ is $\lambda$ -strongly convex), OGD with step size $\eta_t = 1/(\lambda t)$ achieves $\text{Regret}_T = O((G^2/\lambda)\log T)$ . Sketch the proof by modifying the standard OGD analysis.

References

Canonical:

Shalev-Shwartz, Online Learning and Online Convex Optimization, Foundations and Trends in ML (2012)
Hazan, Introduction to Online Convex Optimization (2016), Chapters 1-5

Current:

Orabona, A Modern Introduction to Online Learning (2023), Chapters 2-7
Duchi, Hazan, Singer, "Adaptive Subgradient Methods for Online Learning" (JMLR, 2011)

Next Topics

Natural extensions from OCO:

Stochastic optimization theory: from online regret to convergence rates for i.i.d. losses
Bandit algorithms: what happens when you only observe the loss value, not the gradient

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Convex Optimization Basicslayer 1 · tier 1
No-Regret Learninglayer 3 · tier 2

Derived topics

3

Stochastic Gradient Descent Convergencelayer 2 · tier 1
Multi-Armed Bandits Theorylayer 2 · tier 2
Mirror Descent and Frank-Wolfelayer 3 · tier 2

Graph-backed continuations

Multi-Armed Bandits Theory Stochastic Gradient Descent Convergence Mirror Descent and Frank-Wolfe