No-Regret Learning

Sneiderman, Robby

RL Theory

No-Regret Learning

Online learning against adversarial losses: regret as cumulative loss minus the best fixed action in hindsight, multiplicative weights, follow the regularized leader, and why no-regret dynamics converge to Nash equilibria in zero-sum games.

AdvancedTier 2StableSupporting~55 min

Prerequisites

Common Probability Distributions Concentration Inequalities

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 3 | tier 2. This page has 2 direct prerequisites and 5 published dependents.

Open Atlas Prerequisites Leads to

What next

Online Convex Optimization

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Online learning is the theory behind any setting where decisions are made sequentially against an environment that may be adversarial. Spam filters update as spammers adapt. Recommendation systems adjust as user preferences shift. Auction bidders revise strategies as competitors change tactics.

No-regret algorithms guarantee that, no matter what the environment does, your average performance converges to that of the best fixed action in hindsight. This is a remarkably strong guarantee: it requires no statistical assumptions about how losses are generated. The adversary can be fully adaptive.

The deepest consequence is the connection to game theory: when all players in a zero-sum game use no-regret algorithms, their average strategies converge to a Nash equilibrium. This is the theoretical foundation for self-play methods in AlphaGo, poker AI, and multi-agent reinforcement learning.

Mental Model

Imagine you must choose one of $K$ experts to follow each day. After you choose, nature reveals the losses of all experts. You want your cumulative loss to be close to that of the best expert in hindsight. You do not know which expert will be best, and nature may be adversarial.

The multiplicative weights algorithm maintains a weight for each expert, increasing the weight of experts with low loss and decreasing the weight of experts with high loss. Over time, the algorithm tracks the best expert regardless of the loss sequence.

Formal Setup and Notation

Definition

Online Learning Protocol

The online learning protocol over $T$ rounds with $K$ actions:

At each round $t = 1, \ldots, T$ $t = 1, \dots, T$ :
- Learner selects a distribution $p_t \in \Delta_K$ over $K$ actions
- Adversary (simultaneously) selects a loss vector $\ell_t \in [0,1]^K$
- Learner incurs expected loss $\langle p_t, \ell_t \rangle = \sum_{i=1}^{K} p_t(i)\,\ell_t(i)$

The adversary may be oblivious (loss sequence fixed in advance) or adaptive (losses depend on learner's past actions).

Definition

Regret $R_{T}$

The regret of a learning algorithm after $T$ rounds is:

$R_T = \sum_{t=1}^{T} \langle p_t, \ell_t \rangle - \min_{i \in [K]} \sum_{t=1}^{T} \ell_t(i)$

This is the difference between the learner's cumulative loss and the cumulative loss of the single best action in hindsight. An algorithm is no-regret if and only if $R_T / T \to 0$ as $T \to \infty$ .

Multiplicative Weights Update

Definition

Multiplicative Weights Update (MWU)

The multiplicative weights update algorithm with learning rate $\eta > 0$ :

Initialize weights $w_1(i) = 1$ for all $i \in [K]$
At round $t$ , play $p_t(i) = w_t(i) / \sum_j w_t(j)$
After observing $\ell_t$ , update: $w_{t+1}(i) = w_t(i) \cdot \exp(-\eta\,\ell_t(i))$

Actions with lower loss get multiplicatively higher weight. The exponential update ensures that good actions accumulate weight rapidly.

Main Theorems

Theorem

Multiplicative Weights Regret Bound

Statement

The multiplicative weights update algorithm with learning rate $\eta = \sqrt{2 \ln K / T}$ achieves regret:

$R_T \leq \sqrt{2T \ln K}$

Equivalently, the average regret $R_T / T \leq \sqrt{2 \ln K / T} = O(1/\sqrt{T})$ .

Intuition

The bound has two notable features. First, the dependence on the number of actions is $\sqrt{\ln K}$ , not $\sqrt{K}$ . You can compete with exponentially many actions at only logarithmic cost. Second, the regret grows as $O(\sqrt{T})$ , meaning the average regret $R_T/T$ vanishes as $O(1/\sqrt{T})$ . No matter how the adversary behaves, the algorithm converges to the performance of the best fixed action.

Proof Sketch

Define the potential $\Phi_t = \ln\!\left(\sum_i w_t(i)\right)$ . Initially $\Phi_1 = \ln K$ . At each step, $\Phi_{t+1} - \Phi_t \leq -\eta \langle p_t, \ell_t \rangle + \eta^2/2$ (using $e^{-x} \leq 1 - x + x^2/2$ for $x \in [0,1]$ ). Also, $\Phi_{T+1} \geq -\eta \min_i \sum_t \ell_t(i)$ because the best action's weight provides a lower bound. Telescoping and rearranging:

$\sum_t \langle p_t, \ell_t \rangle - \min_i \sum_t \ell_t(i) \leq \frac{\ln K}{\eta} + \frac{\eta T}{2}$

Setting $\eta = \sqrt{2\ln K / T}$ gives $R_T \leq \sqrt{2T\ln K}$ .

Why It Matters

The $O(\sqrt{T \ln K})$ regret bound is tight: no algorithm can achieve $o(\sqrt{T \ln K})$ regret against an adaptive adversary. This makes multiplicative weights optimal up to constants. The same algorithm, under different names (Hedge, exponentiated gradient, entropic mirror descent), appears throughout machine learning, optimization, and game theory.

Failure Mode

The bound requires knowing $T$ in advance to set $\eta$ . The doubling trick (restart with doubled horizon) removes this requirement at a constant factor cost. Also, the bound competes with the best fixed action. Competing with the best sequence of actions (shifting regret) requires stronger algorithms and incurs higher regret.

report a correction →

Follow the Regularized Leader

Definition

Follow the Regularized Leader (FTRL)

FTRL selects the action distribution that minimizes cumulative loss plus a regularization term:

$p_t = \arg\min_{p \in \Delta_K} \left\{ \sum_{s=1}^{t-1} \langle p, \ell_s \rangle + \frac{1}{\eta} R(p) \right\}$

where $R(p)$ is a strongly convex regularizer. With $R(p) = \sum_i p(i) \ln p(i)$ (negative entropy), FTRL recovers exactly the multiplicative weights algorithm.

Theorem

FTRL Regret Bound

Statement

FTRL with a regularizer $R$ that is 1-strongly convex with respect to a norm $\|\cdot\|$ achieves:

$R_T \leq \frac{R_{\max}}{\eta} + \eta \sum_{t=1}^{T} \|\ell_t\|_*^2$

where $R_{\max} = \max_{p \in \Delta_K} R(p) - \min_{p \in \Delta_K} R(p)$ . With optimal $\eta$ , this gives $R_T = O(\sqrt{T})$ .

Intuition

FTRL balances exploitation (minimize cumulative loss so far) against exploration (the regularizer prevents the distribution from collapsing onto a single action too early). The learning rate $\eta$ controls this tradeoff. Large $\eta$ trusts past losses more (faster convergence, more variance); small $\eta$ regularizes more (slower convergence, more stability).

Why It Matters

FTRL unifies many online learning algorithms. Choosing the regularizer and norm gives different algorithms: negative entropy yields multiplicative weights, squared $\ell_2$ norm yields online gradient descent. This framework extends naturally to online convex optimization where the action space is continuous.

report a correction →

Connection to Nash Equilibria

Theorem

No-Regret Dynamics Converge to Nash Equilibria

Statement

If both players in a two-player zero-sum game use no-regret learning algorithms, then their average strategies $\bar{p} = \frac{1}{T}\sum_{t=1}^T p_t$ and $\bar{q} = \frac{1}{T}\sum_{t=1}^T q_t$ converge to a Nash equilibrium. Specifically, the pair $(\bar{p}, \bar{q})$ is an $\epsilon$ -Nash equilibrium with $\epsilon = (R_T^{(1)} + R_T^{(2)})/T$ , where $R_T^{(1)}$ and $R_T^{(2)}$ are the regret of each player.

Intuition

In a zero-sum game, player 1's loss is player 2's gain. If both players have low regret, neither can improve much by deviating to any fixed strategy. This is precisely the definition of a Nash equilibrium (up to the regret slack). The minimax theorem guarantees that Nash equilibria exist in zero-sum games, and no-regret learning provides a constructive, decentralized method to find them.

Why It Matters

This theorem is the theoretical foundation for self-play in AI. When you train a poker AI or a Go-playing agent by having it play against itself, you are implicitly running no-regret dynamics. The convergence guarantee says that the resulting average strategy is approximately optimal against any opponent. Counterfactual regret minimization (CFR), the algorithm behind superhuman poker AI, is a specific instantiation of this principle.

Failure Mode

Convergence is for average strategies, not last-iterate strategies. The actual play $p_t$ may cycle or oscillate even as the average converges. Also, the result is specific to zero-sum games. In general-sum games, no-regret dynamics may not converge to Nash equilibria (they converge to the weaker notion of coarse correlated equilibria).

report a correction →

Canonical Examples

Example

Expert advice with adversarial weather

Suppose you have $K = 3$ weather forecasters. Each day, you must decide which forecaster to follow. After you commit, the actual weather is revealed and each forecaster's loss is computed. Running multiplicative weights with $T = 10{,}000$ days gives regret at most $\sqrt{2 \times 10{,}000 \times \ln 3} \approx 148$ . Your average daily loss is within $0.015$ of the best forecaster. no matter how adversarial the weather was.

Common Confusions

Watch Out

Regret is not loss

Low regret does not mean low loss. If all actions have high loss, the learner's loss is also high, but its regret is low because the comparator (best fixed action) also has high loss. Regret measures relative performance, not absolute performance.

Watch Out

No-regret is about the average, not the last iterate

The no-regret guarantee says $R_T/T \to 0$ , meaning average regret vanishes. On any individual round, the algorithm might perform badly. Similarly, in the game-theoretic application, convergence to Nash is for the time-averaged strategy, not the current strategy.

Watch Out

No-regret does not need stochastic assumptions

Unlike PAC learning or ERM, no-regret bounds hold against any loss sequence, including adversarially chosen ones. The adversary can observe the learner's past actions and choose losses to maximize regret. The bounds still hold.

Summary

Regret = learner's cumulative loss minus best fixed action's cumulative loss
Multiplicative weights achieves $O(\sqrt{T \ln K})$ regret. optimal up to constants
FTRL with negative entropy regularization recovers multiplicative weights
The $\sqrt{\ln K}$ dependence means you can handle exponentially many actions
When both players in a zero-sum game use no-regret algorithms, average strategies converge to Nash equilibrium at rate $O(1/\sqrt{T})$
This is the theory behind self-play, CFR, and multi-agent RL

Exercises

ExerciseCore

Problem

You run multiplicative weights with $K = 100$ actions for $T = 10{,}000$ rounds. What is the guaranteed upper bound on regret? What is the average per-round regret?

ExerciseAdvanced

Problem

Prove that no deterministic algorithm can achieve sublinear regret against an adaptive adversary with $K = 2$ actions. Then explain why randomization is essential for no-regret learning.

ExerciseResearch

Problem

In a general-sum (not zero-sum) game, no-regret dynamics converge to coarse correlated equilibria, not Nash equilibria. Explain the difference between these two solution concepts and give an intuitive example of why Nash convergence fails outside zero-sum games.

References

Canonical:

Freund & Schapire, "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting" (1997)
Cesa-Bianchi & Lugosi, Prediction, Learning, and Games (2006), Chapters 2-4

Current:

Hazan, Introduction to Online Convex Optimization (2016), Chapters 1-5
Roughgarden, Twenty Lectures on Algorithmic Game Theory (2016), Lectures 17-18

Next Topics

The natural next steps from no-regret learning:

Bandit algorithms: no-regret learning when you only observe the loss of the action you chose
Online convex optimization: extending regret bounds to continuous action spaces

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Common Probability Distributionslayer 0A · tier 1
Concentration Inequalitieslayer 1 · tier 1

Derived topics

5

Multi-Armed Bandits Theorylayer 2 · tier 2
Nash Equilibriumlayer 2 · tier 2
Online Convex Optimizationlayer 3 · tier 2
Online Learning and Banditslayer 3 · tier 2
Self-Play and Multi-Agent RLlayer 3 · tier 2

Graph-backed continuations

Online Convex Optimization Multi-Armed Bandits Theory Nash Equilibrium Self-Play and Multi-Agent RL Online Learning and Bandits