RL Theory
No-Regret Learning
Online learning against adversarial losses: regret as cumulative loss minus the best fixed action in hindsight, multiplicative weights, follow the regularized leader, and why no-regret dynamics converge to Nash equilibria in zero-sum games.
Why This Matters
Online learning is the theory behind any setting where decisions are made sequentially against an environment that may be adversarial. Spam filters update as spammers adapt. Recommendation systems adjust as user preferences shift. Auction bidders revise strategies as competitors change tactics.
No-regret algorithms guarantee that, no matter what the environment does, your average performance converges to that of the best fixed action in hindsight. This is a remarkably strong guarantee: it requires no statistical assumptions about how losses are generated. The adversary can be fully adaptive.
The deepest consequence is the connection to game theory: when all players in a zero-sum game use no-regret algorithms, their average strategies converge to a Nash equilibrium. This is the theoretical foundation for self-play methods in AlphaGo, poker AI, and multi-agent reinforcement learning.
Mental Model
Imagine you must choose one of experts to follow each day. After you choose, nature reveals the losses of all experts. You want your cumulative loss to be close to that of the best expert in hindsight. You do not know which expert will be best, and nature may be adversarial.
The multiplicative weights algorithm maintains a weight for each expert, increasing the weight of experts with low loss and decreasing the weight of experts with high loss. Over time, the algorithm tracks the best expert regardless of the loss sequence.
Formal Setup and Notation
Online Learning Protocol
The online learning protocol over rounds with actions:
- At each round :
- Learner selects a distribution over actions
- Adversary (simultaneously) selects a loss vector
- Learner incurs expected loss
The adversary may be oblivious (loss sequence fixed in advance) or adaptive (losses depend on learner's past actions).
Regret
The regret of a learning algorithm after rounds is:
This is the difference between the learner's cumulative loss and the cumulative loss of the single best action in hindsight. An algorithm is no-regret if as .
Multiplicative Weights Update
Multiplicative Weights Update (MWU)
The multiplicative weights update algorithm with learning rate :
- Initialize weights for all
- At round , play
- After observing , update:
Actions with lower loss get multiplicatively higher weight. The exponential update ensures that good actions accumulate weight rapidly.
Main Theorems
Multiplicative Weights Regret Bound
Statement
The multiplicative weights update algorithm with learning rate achieves regret:
Equivalently, the average regret .
Intuition
The bound has two notable features. First, the dependence on the number of actions is , not . You can compete with exponentially many actions at only logarithmic cost. Second, the regret grows as , meaning the average regret vanishes as . No matter how the adversary behaves, the algorithm converges to the performance of the best fixed action.
Proof Sketch
Define the potential . Initially . At each step, (using for ). Also, because the best action's weight provides a lower bound. Telescoping and rearranging:
Setting gives .
Why It Matters
The regret bound is tight: no algorithm can achieve regret against an adaptive adversary. This makes multiplicative weights optimal up to constants. The same algorithm, under different names (Hedge, exponentiated gradient, entropic mirror descent), appears throughout machine learning, optimization, and game theory.
Failure Mode
The bound requires knowing in advance to set . The doubling trick (restart with doubled horizon) removes this requirement at a constant factor cost. Also, the bound competes with the best fixed action. Competing with the best sequence of actions (shifting regret) requires stronger algorithms and incurs higher regret.
Follow the Regularized Leader
Follow the Regularized Leader (FTRL)
FTRL selects the action distribution that minimizes cumulative loss plus a regularization term:
where is a strongly convex regularizer. With (negative entropy), FTRL recovers exactly the multiplicative weights algorithm.
FTRL Regret Bound
Statement
FTRL with a regularizer that is 1-strongly convex with respect to a norm achieves:
where . With optimal , this gives .
Intuition
FTRL balances exploitation (minimize cumulative loss so far) against exploration (the regularizer prevents the distribution from collapsing onto a single action too early). The learning rate controls this tradeoff. Large trusts past losses more (faster convergence, more variance); small regularizes more (slower convergence, more stability).
Why It Matters
FTRL unifies many online learning algorithms. Choosing the regularizer and norm gives different algorithms: negative entropy yields multiplicative weights, squared norm yields online gradient descent. This framework extends naturally to online convex optimization where the action space is continuous.
Connection to Nash Equilibria
No-Regret Dynamics Converge to Nash Equilibria
Statement
If both players in a two-player zero-sum game use no-regret learning algorithms, then their average strategies and converge to a Nash equilibrium. Specifically, the pair is an -Nash equilibrium with , where and are the regret of each player.
Intuition
In a zero-sum game, player 1's loss is player 2's gain. If both players have low regret, neither can improve much by deviating to any fixed strategy. This is precisely the definition of a Nash equilibrium (up to the regret slack). The minimax theorem guarantees that Nash equilibria exist in zero-sum games, and no-regret learning provides a constructive, decentralized method to find them.
Why It Matters
This theorem is the theoretical foundation for self-play in AI. When you train a poker AI or a Go-playing agent by having it play against itself, you are implicitly running no-regret dynamics. The convergence guarantee says that the resulting average strategy is approximately optimal against any opponent. Counterfactual regret minimization (CFR), the algorithm behind superhuman poker AI, is a specific instantiation of this principle.
Failure Mode
Convergence is for average strategies, not last-iterate strategies. The actual play may cycle or oscillate even as the average converges. Also, the result is specific to zero-sum games. In general-sum games, no-regret dynamics may not converge to Nash equilibria (they converge to the weaker notion of coarse correlated equilibria).
Canonical Examples
Expert advice with adversarial weather
Suppose you have weather forecasters. Each day, you must decide which forecaster to follow. After you commit, the actual weather is revealed and each forecaster's loss is computed. Running multiplicative weights with days gives regret at most . Your average daily loss is within of the best forecaster. no matter how adversarial the weather was.
Common Confusions
Regret is not loss
Low regret does not mean low loss. If all actions have high loss, the learner's loss is also high, but its regret is low because the comparator (best fixed action) also has high loss. Regret measures relative performance, not absolute performance.
No-regret is about the average, not the last iterate
The no-regret guarantee says , meaning average regret vanishes. On any individual round, the algorithm might perform badly. Similarly, in the game-theoretic application, convergence to Nash is for the time-averaged strategy, not the current strategy.
No-regret does not need stochastic assumptions
Unlike PAC learning or ERM, no-regret bounds hold against any loss sequence, including adversarially chosen ones. The adversary can observe the learner's past actions and choose losses to maximize regret. The bounds still hold.
Summary
- Regret = learner's cumulative loss minus best fixed action's cumulative loss
- Multiplicative weights achieves regret. optimal up to constants
- FTRL with negative entropy regularization recovers multiplicative weights
- The dependence means you can handle exponentially many actions
- When both players in a zero-sum game use no-regret algorithms, average strategies converge to Nash equilibrium at rate
- This is the theory behind self-play, CFR, and multi-agent RL
Exercises
Problem
You run multiplicative weights with actions for rounds. What is the guaranteed upper bound on regret? What is the average per-round regret?
Problem
Prove that no deterministic algorithm can achieve sublinear regret against an adaptive adversary with actions. Then explain why randomization is essential for no-regret learning.
Problem
In a general-sum (not zero-sum) game, no-regret dynamics converge to coarse correlated equilibria, not Nash equilibria. Explain the difference between these two solution concepts and give an intuitive example of why Nash convergence fails outside zero-sum games.
References
Canonical:
- Freund & Schapire, "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting" (1997)
- Cesa-Bianchi & Lugosi, Prediction, Learning, and Games (2006), Chapters 2-4
Current:
- Hazan, Introduction to Online Convex Optimization (2016), Chapters 1-5
- Roughgarden, Twenty Lectures on Algorithmic Game Theory (2016), Lectures 17-18
Next Topics
The natural next steps from no-regret learning:
- Bandit algorithms: no-regret learning when you only observe the loss of the action you chose
- Online convex optimization: extending regret bounds to continuous action spaces
Last reviewed: April 2026
Builds on This
- Online Convex OptimizationLayer 3
- Online Learning and BanditsLayer 3