Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Online Convex Optimization

A general framework for sequential decision-making with convex losses: online gradient descent, follow the regularized leader, adaptive methods, and the O(sqrt(T)) regret guarantee that unifies many algorithms.

AdvancedTier 2Stable~55 min
0

Why This Matters

Online convex optimization (OCO) is a unifying framework. Online gradient descent, AdaGrad, mirror descent, and follow-the-regularized-leader are all instances of one template. By proving regret bounds in the OCO framework, you get guarantees for all of these algorithms simultaneously. The online-to-batch conversion then transfers these guarantees to the standard stochastic optimization setting.

Many algorithms you use daily in ML (Adam, AdaGrad, SGD with momentum) have their tightest known analyses in the OCO framework.

Mental Model

An adversary repeatedly chooses a convex loss function. You must choose a decision before seeing the loss. After TT rounds, your regret measures how much worse you did compared to the best fixed decision in hindsight. An algorithm is "good" if regret grows sublinearly in TT, meaning your per-round average loss converges to the best fixed decision.

Formal Setup

Let KRd\mathcal{K} \subseteq \mathbb{R}^d be a convex, compact decision set. At each round t=1,,Tt = 1, \ldots, T:

  1. The learner picks wtKw_t \in \mathcal{K}
  2. The adversary reveals a convex loss function ft:KRf_t : \mathcal{K} \to \mathbb{R}
  3. The learner incurs loss ft(wt)f_t(w_t)
Definition

Regret

The regret after TT rounds is:

RegretT=t=1Tft(wt)minwKt=1Tft(w)\text{Regret}_T = \sum_{t=1}^{T} f_t(w_t) - \min_{w \in \mathcal{K}} \sum_{t=1}^{T} f_t(w)

The comparator is the single best fixed decision in hindsight.

Definition

Online Gradient Descent (OGD)

Starting from w1Kw_1 \in \mathcal{K}, OGD updates:

wt+1=ΠK(wtηtft(wt))w_{t+1} = \Pi_{\mathcal{K}}\left(w_t - \eta_t \nabla f_t(w_t)\right)

where ΠK\Pi_{\mathcal{K}} is the Euclidean projection onto K\mathcal{K} and ηt\eta_t is the step size at round tt.

Definition

Follow the Regularized Leader (FTRL)

FTRL selects:

wt+1=argminwK(s=1tfs(w)+R(w))w_{t+1} = \arg\min_{w \in \mathcal{K}} \left(\sum_{s=1}^{t} f_s(w) + R(w)\right)

where R:KRR: \mathcal{K} \to \mathbb{R} is a strongly convex regularizer. With R(w)=12ηw2R(w) = \frac{1}{2\eta}\|w\|^2, FTRL is closely related to OGD.

Main Theorems

Theorem

Online Gradient Descent Regret Bound

Statement

Let K\mathcal{K} have diameter D=supu,vKuvD = \sup_{u,v \in \mathcal{K}} \|u - v\| and suppose ft(w)G\|\nabla f_t(w)\| \leq G for all t,wt, w. With step size ηt=D/(Gt)\eta_t = D/(G\sqrt{t}), OGD satisfies:

RegretT32DGT\text{Regret}_T \leq \frac{3}{2} DG\sqrt{T}

With the optimal constant step size η=D/(GT)\eta = D/(G\sqrt{T}) (requiring knowledge of TT):

RegretTDGT\text{Regret}_T \leq DG\sqrt{T}

Intuition

Each step of OGD reduces regret against the comparator by following the gradient, but each step also incurs error from the noisy gradient. The T\sqrt{T} comes from balancing cumulative step errors (ηT\sim \eta T) against cumulative gradient tracking (T/η\sim T/\eta). Optimizing over η\eta gives T\sqrt{T}.

Proof Sketch

For any comparator uKu \in \mathcal{K}, use the three-point identity:

ft(wt)ft(u)ft(wt),wtuwtu2wt+1u22ηt+ηt2ft(wt)2f_t(w_t) - f_t(u) \leq \langle \nabla f_t(w_t), w_t - u \rangle \leq \frac{\|w_t - u\|^2 - \|w_{t+1} - u\|^2}{2\eta_t} + \frac{\eta_t}{2}\|\nabla f_t(w_t)\|^2

Sum over t=1,,Tt = 1, \ldots, T. The first term telescopes. The second sums to G22tηt\frac{G^2}{2}\sum_t \eta_t. With ηt=D/(Gt)\eta_t = D/(G\sqrt{t}), the total is O(DGT)O(DG\sqrt{T}).

Why It Matters

The O(T)O(\sqrt{T}) regret bound is optimal for general convex losses (matching lower bounds). This means average regret RegretT/T=O(1/T)0\text{Regret}_T/T = O(1/\sqrt{T}) \to 0, so OGD is a no-regret algorithm. The proof technique (telescoping via Bregman divergences) is the template for analyzing nearly all first-order online methods.

Failure Mode

The bound assumes bounded gradients and a bounded domain. For unconstrained optimization or heavy-tailed gradients, the guarantee breaks. For strongly convex losses, the O(T)O(\sqrt{T}) rate is loose; O(logT)O(\log T) is achievable and optimal.

Theorem

FTRL Regret Bound

Statement

Let RR be σ\sigma-strongly convex with respect to norm \|\cdot\| on K\mathcal{K}. Define the dual norm bound G=maxtftG_* = \max_t \|\nabla f_t\|_*. FTRL with regularizer η1R\eta^{-1} R satisfies:

RegretTmaxwKR(w)minwKR(w)η+η2σt=1Tft(wt)2\text{Regret}_T \leq \frac{\max_{w \in \mathcal{K}} R(w) - \min_{w \in \mathcal{K}} R(w)}{\eta} + \frac{\eta}{2\sigma}\sum_{t=1}^{T} \|\nabla f_t(w_t)\|_*^2

With optimal η\eta, this gives RegretT=O(GT/σmaxRminR)\text{Regret}_T = O(G_* \sqrt{T/\sigma} \cdot \sqrt{\max R - \min R}).

Intuition

FTRL balances two forces: the regularizer keeps decisions stable (preventing oscillation), while the cumulative loss pulls decisions toward the best action. The regret bound trades off the initial penalty from regularization against the stability gained.

Proof Sketch

Define Φt(w)=s=1tfs(w)+η1R(w)\Phi_t(w) = \sum_{s=1}^{t} f_s(w) + \eta^{-1} R(w). Since wt+1=argminΦtw_{t+1} = \arg\min \Phi_t, use the strong convexity of Φt\Phi_t to bound ft(wt)ft(wt+1)f_t(w_t) - f_t(w_{t+1}). Telescope and use the fact that Φ0(w1)Φ0(u)η1(R(w1)R(u))\Phi_0(w_1) - \Phi_0(u) \leq \eta^{-1}(R(w_1) - R(u)).

Why It Matters

FTRL generalizes OGD to arbitrary geometries via the choice of regularizer. With R(w)=12w22R(w) = \frac{1}{2}\|w\|_2^2, you recover OGD. With R(w)=iwilogwiR(w) = \sum_i w_i \log w_i (negative entropy) on the simplex, you recover the multiplicative weights/hedge algorithm. One framework, many algorithms.

Failure Mode

Choosing the right regularizer requires knowing the geometry of the problem. A poorly matched regularizer (e.g., Euclidean regularization on the simplex) gives suboptimal regret. Computing the FTRL update requires solving a convex optimization problem at each step, which may be expensive for complex RR and K\mathcal{K}.

Adaptive Methods: AdaGrad as OCO

Theorem

AdaGrad Diagonal Regret Bound

Statement

AdaGrad (diagonal version) uses per-coordinate step sizes ηt,j=D/(2s=1tgs,j2)\eta_{t,j} = D/(2\sqrt{\sum_{s=1}^{t} g_{s,j}^2}) where gt,jg_{t,j} is the jj-th component of ft(wt)\nabla f_t(w_t). Its regret satisfies:

RegretT2Dj=1dt=1Tgt,j2\text{Regret}_T \leq 2D \sum_{j=1}^{d} \sqrt{\sum_{t=1}^{T} g_{t,j}^2}

If all gradient components are bounded by GG, this is at most 2DGdT2DG\sqrt{dT}. But if gradients are sparse (most components are zero), the bound can be much smaller.

Intuition

AdaGrad gives larger step sizes to infrequent features and smaller step sizes to frequent ones. Coordinates with small cumulative gradient magnitude get more aggressive updates. This is automatic adaptation without needing to know the gradient distribution in advance.

Proof Sketch

Apply the FTRL framework with a time-varying regularizer Rt(w)=js=1tgs,j22Dwj2R_t(w) = \sum_j \frac{\sqrt{\sum_{s=1}^{t} g_{s,j}^2}}{2D} w_j^2. The regret bound follows from bounding the regularizer term and the gradient term separately. The key insight is that the regularizer grows with the observed gradients, automatically balancing stability and responsiveness.

Why It Matters

AdaGrad shows that adaptive step sizes emerge naturally from the OCO framework. The online-to-batch conversion of AdaGrad gives convergence guarantees for stochastic optimization that adapt to the gradient noise structure. This is the theoretical foundation for Adam and other adaptive gradient descent variants used in deep learning.

Failure Mode

AdaGrad's step sizes monotonically decrease, which can be too aggressive for non-stationary problems. In deep learning, this motivates Adam (which uses exponential moving averages instead of cumulative sums). The diagonal approximation ignores correlations between coordinates.

Online-to-Batch Conversion

If an online algorithm achieves RegretT=O(T)\text{Regret}_T = O(\sqrt{T}) against adversarial losses, it also works for stochastic optimization. Given i.i.d. losses f1,,fTf_1, \ldots, f_T drawn from a distribution, the average iterate wˉ=1Ttwt\bar{w} = \frac{1}{T}\sum_t w_t satisfies:

E[f(wˉ)]minwKf(w)RegretTT\mathbb{E}[f(\bar{w})] - \min_{w \in \mathcal{K}} f(w) \leq \frac{\text{Regret}_T}{T}

where f(w)=E[ft(w)]f(w) = \mathbb{E}[f_t(w)]. So O(T)O(\sqrt{T}) regret gives O(1/T)O(1/\sqrt{T}) optimization error. This is the standard SGD rate for convex optimization, derived with no extra work.

Connection to Bandit Optimization

In the bandit setting, you observe only the loss value ft(wt)f_t(w_t), not the gradient. You must estimate gradients from function values alone. The standard approach: pick a random perturbation utu_t on the unit sphere and estimate the gradient via:

g^t=dδft(wt+δut)ut\hat{g}_t = \frac{d}{\delta} f_t(w_t + \delta u_t) u_t

This is an unbiased estimator of the gradient of a smoothed version of ftf_t. Plugging into OGD gives O(dT)O(d\sqrt{T}) regret in the bandit setting, a factor of dd worse than the full-information case.

Common Confusions

Watch Out

Regret is not the same as convergence

An algorithm with O(T)O(\sqrt{T}) regret does not necessarily converge to a single point. The iterates wtw_t may keep moving. What converges is the average performance: the per-round regret RegretT/T0\text{Regret}_T/T \to 0.

Watch Out

The comparator is fixed, not adaptive

Standard regret compares against the single best fixed decision in hindsight. It does not compare against a sequence of decisions that changes over time. Competing with changing comparators requires adaptive regret or dynamic regret, which have different (larger) bounds.

Watch Out

OGD is not the same as SGD

OGD operates on adversarial loss sequences with no distributional assumptions. SGD assumes i.i.d. stochastic gradients. OGD's regret bound implies SGD's convergence bound via online-to-batch conversion, but not the other way around. OGD is the stronger result.

Key Takeaways

  • OCO: learner picks wtw_t, adversary reveals convex ftf_t, regret measures cumulative gap
  • OGD achieves O(DGT)O(DG\sqrt{T}) regret, optimal for general convex losses
  • FTRL unifies OGD, multiplicative weights, and mirror descent via regularizer choice
  • AdaGrad adapts step sizes using cumulative gradient information, excels with sparse gradients
  • Online-to-batch conversion: O(T)O(\sqrt{T}) regret implies O(1/T)O(1/\sqrt{T}) stochastic optimization rate
  • Strongly convex losses give O(logT)O(\log T) regret, much faster than the general O(T)O(\sqrt{T})

Exercises

ExerciseCore

Problem

OGD with constant step size η=D/(GT)\eta = D/(G\sqrt{T}) achieves regret DGTDG\sqrt{T}. What is the per-round average regret after T=10,000T = 10{,}000 rounds with D=1D = 1 and G=1G = 1?

ExerciseAdvanced

Problem

Show that for strongly convex losses (each ftf_t is λ\lambda-strongly convex), OGD with step size ηt=1/(λt)\eta_t = 1/(\lambda t) achieves RegretT=O((G2/λ)logT)\text{Regret}_T = O((G^2/\lambda)\log T). Sketch the proof by modifying the standard OGD analysis.

References

Canonical:

  • Shalev-Shwartz, Online Learning and Online Convex Optimization, Foundations and Trends in ML (2012)
  • Hazan, Introduction to Online Convex Optimization (2016), Chapters 1-5

Current:

  • Orabona, A Modern Introduction to Online Learning (2023), Chapters 2-7
  • Duchi, Hazan, Singer, "Adaptive Subgradient Methods for Online Learning" (JMLR, 2011)

Next Topics

Natural extensions from OCO:

  • Stochastic optimization theory: from online regret to convergence rates for i.i.d. losses
  • Bandit algorithms: what happens when you only observe the loss value, not the gradient

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.