Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

No-Regret Learning

Online learning against adversarial losses: regret as cumulative loss minus the best fixed action in hindsight, multiplicative weights, follow the regularized leader, and why no-regret dynamics converge to Nash equilibria in zero-sum games.

AdvancedTier 2Stable~55 min
0

Why This Matters

Online learning is the theory behind any setting where decisions are made sequentially against an environment that may be adversarial. Spam filters update as spammers adapt. Recommendation systems adjust as user preferences shift. Auction bidders revise strategies as competitors change tactics.

No-regret algorithms guarantee that, no matter what the environment does, your average performance converges to that of the best fixed action in hindsight. This is a remarkably strong guarantee: it requires no statistical assumptions about how losses are generated. The adversary can be fully adaptive.

The deepest consequence is the connection to game theory: when all players in a zero-sum game use no-regret algorithms, their average strategies converge to a Nash equilibrium. This is the theoretical foundation for self-play methods in AlphaGo, poker AI, and multi-agent reinforcement learning.

Mental Model

Imagine you must choose one of KK experts to follow each day. After you choose, nature reveals the losses of all experts. You want your cumulative loss to be close to that of the best expert in hindsight. You do not know which expert will be best, and nature may be adversarial.

The multiplicative weights algorithm maintains a weight for each expert, increasing the weight of experts with low loss and decreasing the weight of experts with high loss. Over time, the algorithm tracks the best expert regardless of the loss sequence.

Formal Setup and Notation

Definition

Online Learning Protocol

The online learning protocol over TT rounds with KK actions:

  1. At each round t=1,,Tt = 1, \ldots, T:
    • Learner selects a distribution ptΔKp_t \in \Delta_K over KK actions
    • Adversary (simultaneously) selects a loss vector t[0,1]K\ell_t \in [0,1]^K
    • Learner incurs expected loss pt,t=i=1Kpt(i)t(i)\langle p_t, \ell_t \rangle = \sum_{i=1}^{K} p_t(i)\,\ell_t(i)

The adversary may be oblivious (loss sequence fixed in advance) or adaptive (losses depend on learner's past actions).

Definition

Regret

The regret of a learning algorithm after TT rounds is:

RT=t=1Tpt,tmini[K]t=1Tt(i)R_T = \sum_{t=1}^{T} \langle p_t, \ell_t \rangle - \min_{i \in [K]} \sum_{t=1}^{T} \ell_t(i)

This is the difference between the learner's cumulative loss and the cumulative loss of the single best action in hindsight. An algorithm is no-regret if RT/T0R_T / T \to 0 as TT \to \infty.

Multiplicative Weights Update

Definition

Multiplicative Weights Update (MWU)

The multiplicative weights update algorithm with learning rate η>0\eta > 0:

  1. Initialize weights w1(i)=1w_1(i) = 1 for all i[K]i \in [K]
  2. At round tt, play pt(i)=wt(i)/jwt(j)p_t(i) = w_t(i) / \sum_j w_t(j)
  3. After observing t\ell_t, update: wt+1(i)=wt(i)exp(ηt(i))w_{t+1}(i) = w_t(i) \cdot \exp(-\eta\,\ell_t(i))

Actions with lower loss get multiplicatively higher weight. The exponential update ensures that good actions accumulate weight rapidly.

Main Theorems

Theorem

Multiplicative Weights Regret Bound

Statement

The multiplicative weights update algorithm with learning rate η=2lnK/T\eta = \sqrt{2 \ln K / T} achieves regret:

RT2TlnKR_T \leq \sqrt{2T \ln K}

Equivalently, the average regret RT/T2lnK/T=O(1/T)R_T / T \leq \sqrt{2 \ln K / T} = O(1/\sqrt{T}).

Intuition

The bound has two notable features. First, the dependence on the number of actions is lnK\sqrt{\ln K}, not K\sqrt{K}. You can compete with exponentially many actions at only logarithmic cost. Second, the regret grows as O(T)O(\sqrt{T}), meaning the average regret RT/TR_T/T vanishes as O(1/T)O(1/\sqrt{T}). No matter how the adversary behaves, the algorithm converges to the performance of the best fixed action.

Proof Sketch

Define the potential Φt=ln ⁣(iwt(i))\Phi_t = \ln\!\left(\sum_i w_t(i)\right). Initially Φ1=lnK\Phi_1 = \ln K. At each step, Φt+1Φtηpt,t+η2/2\Phi_{t+1} - \Phi_t \leq -\eta \langle p_t, \ell_t \rangle + \eta^2/2 (using ex1x+x2/2e^{-x} \leq 1 - x + x^2/2 for x[0,1]x \in [0,1]). Also, ΦT+1ηminitt(i)\Phi_{T+1} \geq -\eta \min_i \sum_t \ell_t(i) because the best action's weight provides a lower bound. Telescoping and rearranging:

tpt,tminitt(i)lnKη+ηT2\sum_t \langle p_t, \ell_t \rangle - \min_i \sum_t \ell_t(i) \leq \frac{\ln K}{\eta} + \frac{\eta T}{2}

Setting η=2lnK/T\eta = \sqrt{2\ln K / T} gives RT2TlnKR_T \leq \sqrt{2T\ln K}.

Why It Matters

The O(TlnK)O(\sqrt{T \ln K}) regret bound is tight: no algorithm can achieve o(TlnK)o(\sqrt{T \ln K}) regret against an adaptive adversary. This makes multiplicative weights optimal up to constants. The same algorithm, under different names (Hedge, exponentiated gradient, entropic mirror descent), appears throughout machine learning, optimization, and game theory.

Failure Mode

The bound requires knowing TT in advance to set η\eta. The doubling trick (restart with doubled horizon) removes this requirement at a constant factor cost. Also, the bound competes with the best fixed action. Competing with the best sequence of actions (shifting regret) requires stronger algorithms and incurs higher regret.

Follow the Regularized Leader

Definition

Follow the Regularized Leader (FTRL)

FTRL selects the action distribution that minimizes cumulative loss plus a regularization term:

pt=argminpΔK{s=1t1p,s+1ηR(p)}p_t = \arg\min_{p \in \Delta_K} \left\{ \sum_{s=1}^{t-1} \langle p, \ell_s \rangle + \frac{1}{\eta} R(p) \right\}

where R(p)R(p) is a strongly convex regularizer. With R(p)=ip(i)lnp(i)R(p) = \sum_i p(i) \ln p(i) (negative entropy), FTRL recovers exactly the multiplicative weights algorithm.

Theorem

FTRL Regret Bound

Statement

FTRL with a regularizer RR that is 1-strongly convex with respect to a norm \|\cdot\| achieves:

RTRmaxη+ηt=1Tt2R_T \leq \frac{R_{\max}}{\eta} + \eta \sum_{t=1}^{T} \|\ell_t\|_*^2

where Rmax=maxpΔKR(p)minpΔKR(p)R_{\max} = \max_{p \in \Delta_K} R(p) - \min_{p \in \Delta_K} R(p). With optimal η\eta, this gives RT=O(T)R_T = O(\sqrt{T}).

Intuition

FTRL balances exploitation (minimize cumulative loss so far) against exploration (the regularizer prevents the distribution from collapsing onto a single action too early). The learning rate η\eta controls this tradeoff. Large η\eta trusts past losses more (faster convergence, more variance); small η\eta regularizes more (slower convergence, more stability).

Why It Matters

FTRL unifies many online learning algorithms. Choosing the regularizer and norm gives different algorithms: negative entropy yields multiplicative weights, squared 2\ell_2 norm yields online gradient descent. This framework extends naturally to online convex optimization where the action space is continuous.

Connection to Nash Equilibria

Theorem

No-Regret Dynamics Converge to Nash Equilibria

Statement

If both players in a two-player zero-sum game use no-regret learning algorithms, then their average strategies pˉ=1Tt=1Tpt\bar{p} = \frac{1}{T}\sum_{t=1}^T p_t and qˉ=1Tt=1Tqt\bar{q} = \frac{1}{T}\sum_{t=1}^T q_t converge to a Nash equilibrium. Specifically, the pair (pˉ,qˉ)(\bar{p}, \bar{q}) is an ϵ\epsilon-Nash equilibrium with ϵ=(RT(1)+RT(2))/T\epsilon = (R_T^{(1)} + R_T^{(2)})/T, where RT(1)R_T^{(1)} and RT(2)R_T^{(2)} are the regret of each player.

Intuition

In a zero-sum game, player 1's loss is player 2's gain. If both players have low regret, neither can improve much by deviating to any fixed strategy. This is precisely the definition of a Nash equilibrium (up to the regret slack). The minimax theorem guarantees that Nash equilibria exist in zero-sum games, and no-regret learning provides a constructive, decentralized method to find them.

Why It Matters

This theorem is the theoretical foundation for self-play in AI. When you train a poker AI or a Go-playing agent by having it play against itself, you are implicitly running no-regret dynamics. The convergence guarantee says that the resulting average strategy is approximately optimal against any opponent. Counterfactual regret minimization (CFR), the algorithm behind superhuman poker AI, is a specific instantiation of this principle.

Failure Mode

Convergence is for average strategies, not last-iterate strategies. The actual play ptp_t may cycle or oscillate even as the average converges. Also, the result is specific to zero-sum games. In general-sum games, no-regret dynamics may not converge to Nash equilibria (they converge to the weaker notion of coarse correlated equilibria).

Canonical Examples

Example

Expert advice with adversarial weather

Suppose you have K=3K = 3 weather forecasters. Each day, you must decide which forecaster to follow. After you commit, the actual weather is revealed and each forecaster's loss is computed. Running multiplicative weights with T=10,000T = 10{,}000 days gives regret at most 2×10,000×ln3148\sqrt{2 \times 10{,}000 \times \ln 3} \approx 148. Your average daily loss is within 0.0150.015 of the best forecaster. no matter how adversarial the weather was.

Common Confusions

Watch Out

Regret is not loss

Low regret does not mean low loss. If all actions have high loss, the learner's loss is also high, but its regret is low because the comparator (best fixed action) also has high loss. Regret measures relative performance, not absolute performance.

Watch Out

No-regret is about the average, not the last iterate

The no-regret guarantee says RT/T0R_T/T \to 0, meaning average regret vanishes. On any individual round, the algorithm might perform badly. Similarly, in the game-theoretic application, convergence to Nash is for the time-averaged strategy, not the current strategy.

Watch Out

No-regret does not need stochastic assumptions

Unlike PAC learning or ERM, no-regret bounds hold against any loss sequence, including adversarially chosen ones. The adversary can observe the learner's past actions and choose losses to maximize regret. The bounds still hold.

Summary

  • Regret = learner's cumulative loss minus best fixed action's cumulative loss
  • Multiplicative weights achieves O(TlnK)O(\sqrt{T \ln K}) regret. optimal up to constants
  • FTRL with negative entropy regularization recovers multiplicative weights
  • The lnK\sqrt{\ln K} dependence means you can handle exponentially many actions
  • When both players in a zero-sum game use no-regret algorithms, average strategies converge to Nash equilibrium at rate O(1/T)O(1/\sqrt{T})
  • This is the theory behind self-play, CFR, and multi-agent RL

Exercises

ExerciseCore

Problem

You run multiplicative weights with K=100K = 100 actions for T=10,000T = 10{,}000 rounds. What is the guaranteed upper bound on regret? What is the average per-round regret?

ExerciseAdvanced

Problem

Prove that no deterministic algorithm can achieve sublinear regret against an adaptive adversary with K=2K = 2 actions. Then explain why randomization is essential for no-regret learning.

ExerciseResearch

Problem

In a general-sum (not zero-sum) game, no-regret dynamics converge to coarse correlated equilibria, not Nash equilibria. Explain the difference between these two solution concepts and give an intuitive example of why Nash convergence fails outside zero-sum games.

References

Canonical:

  • Freund & Schapire, "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting" (1997)
  • Cesa-Bianchi & Lugosi, Prediction, Learning, and Games (2006), Chapters 2-4

Current:

  • Hazan, Introduction to Online Convex Optimization (2016), Chapters 1-5
  • Roughgarden, Twenty Lectures on Algorithmic Game Theory (2016), Lectures 17-18

Next Topics

The natural next steps from no-regret learning:

  • Bandit algorithms: no-regret learning when you only observe the loss of the action you chose
  • Online convex optimization: extending regret bounds to continuous action spaces

Last reviewed: April 2026

Builds on This