Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Self-Play and Multi-Agent RL

Self-play as a training paradigm for competitive games, fictitious play convergence, AlphaGo/AlphaZero, and the challenges of multi-agent reinforcement learning: non-stationarity, partial observability, and centralized training.

AdvancedTier 2Current~55 min
0

Why This Matters

Single-agent RL assumes a stationary environment described by a Markov decision process. In competitive and cooperative settings, the environment includes other learning agents, making it non-stationary. Self-play sidesteps part of this problem: an agent plays against copies of itself, generating an ever-improving curriculum. AlphaGo and AlphaZero demonstrated that self-play, combined with MCTS and neural network function approximation, can achieve superhuman performance in Go, chess, and shogi without human game data.

Multi-agent RL (MARL) generalizes beyond two-player zero-sum games to cooperative, competitive, and mixed settings. The theoretical foundations are less clean, but the practical applications (autonomous driving, communication protocols, economic modeling) are broad.

Self-Play

Definition

Self-Play

A training procedure where an agent improves by playing against itself (or against past versions of itself). At each iteration, the current policy πt\pi_t generates opponent behavior, and the agent updates its policy by optimizing against this opponent.

The key idea: if the agent can beat its past self, it is improving. The quality of the training signal improves automatically as the agent gets stronger, unlike fixed opponent training where the curriculum stagnates.

AlphaGo (2016). Combined supervised learning from human games, self-play reinforcement learning via REINFORCE, and Monte Carlo tree search (MCTS). The supervised network provided a warm start; self-play refined it.

AlphaZero (2017). Removed human data entirely. Started from random play, used self-play with MCTS to generate training data, and trained a neural network to predict both the policy (move probabilities) and value (win probability) from the board state. Achieved superhuman play in Go, chess, and shogi.

The AlphaZero training loop:

  1. Play games using MCTS guided by current network fθf_\theta
  2. Store (st,πt,zt)(s_t, \pi_t, z_t) tuples: state, MCTS policy, game outcome
  3. Train fθf_\theta to minimize (pπ)2+(vz)2(\mathbf{p} - \pi)^2 + (v - z)^2 where p\mathbf{p} and vv are network outputs
  4. Repeat

Fictitious Play

Definition

Fictitious Play

Each player maintains a belief about the opponent's mixed strategy based on the empirical frequency of the opponent's past actions. At each round, each player best-responds to this empirical distribution:

σit+1=BRi(1ts=1tais)\sigma_i^{t+1} = \text{BR}_i\left(\frac{1}{t}\sum_{s=1}^{t} a_{-i}^s\right)

where aisa_{-i}^s is the opponent's action at round ss.

Theorem

Fictitious Play Convergence in Zero-Sum Games

Statement

In any finite two-player zero-sum game, the empirical distribution of play under fictitious play converges to the set of Nash equilibria. Specifically, the average strategies σˉit=1ts=1tais\bar{\sigma}_i^t = \frac{1}{t}\sum_{s=1}^{t} a_i^s satisfy:

maxivi(σˉ1t,σˉ2t)vi0 as t\max_i \left| v_i(\bar{\sigma}_1^t, \bar{\sigma}_2^t) - v_i^* \right| \to 0 \text{ as } t \to \infty

where viv_i^* is the Nash equilibrium payoff.

Intuition

In a zero-sum game, the minimax theorem guarantees a Nash equilibrium in mixed strategies. Fictitious play finds it because best-responding to the opponent's empirical average gradually eliminates strategies that are not part of any equilibrium. The averaging smooths out the cycling behavior of the pure-strategy best responses.

Proof Sketch

Robinson (1951) proved this by showing that the sum of players' payoffs under fictitious play is a Lyapunov function. Each best response weakly increases the average payoff, and the payoff is bounded, so the sequence converges. The limit must be a Nash equilibrium because any non-equilibrium point allows a profitable deviation, contradicting convergence.

Why It Matters

Fictitious play is the simplest algorithm provably finding Nash equilibria in zero-sum games. Self-play in RL can be viewed as an approximate, continuous version of fictitious play where the best response is computed via RL rather than exact optimization.

Failure Mode

Fictitious play does not converge in general-sum games. Shapley (1964) showed a 3x3 game where fictitious play cycles forever. It also converges slowly: O(1/t)O(1/t) in zero-sum games, which is no better than no-regret algorithms.

Connection to No-Regret Learning

Theorem

No-Regret Self-Play Converges to Nash

Statement

If both players in a two-player zero-sum game use no-regret learning algorithms with regret bounded by R(T)R(T), then the average strategy profile (σˉ1T,σˉ2T)(\bar{\sigma}_1^T, \bar{\sigma}_2^T) is a 2R(T)/T2R(T)/T-approximate Nash equilibrium. For algorithms with R(T)=O(T)R(T) = O(\sqrt{T}), the approximation error vanishes as O(1/T)O(1/\sqrt{T}).

Intuition

No-regret means: in hindsight, neither player could have done much better by switching to any fixed strategy. In a zero-sum game, if neither player has regret, their average play must be close to minimax. Any deviation from Nash would create exploitable regret for one of the players.

Proof Sketch

Let vv^* be the game value. Player 1's no-regret guarantee gives 1Ttu1(a1t,a2t)maxa11Ttu1(a1,a2t)R1(T)/TvR1(T)/T\frac{1}{T}\sum_t u_1(a_1^t, a_2^t) \geq \max_{a_1} \frac{1}{T}\sum_t u_1(a_1, a_2^t) - R_1(T)/T \geq v^* - R_1(T)/T. Similarly for player 2. Since u1+u2=0u_1 + u_2 = 0, combining yields that the average play is within (R1(T)+R2(T))/T(R_1(T) + R_2(T))/T of Nash.

Why It Matters

This provides the theoretical basis for why self-play works. Algorithms like multiplicative weights, regret matching, and CFR (counterfactual regret minimization) satisfy no-regret and are used in practice for poker and other imperfect information games.

Failure Mode

The guarantee only holds for the average strategy, not the last iterate. The last iterate may cycle. Also, this only applies to two-player zero-sum games. In general-sum or multi-player settings, no-regret learning does not guarantee convergence to Nash equilibria.

Multi-Agent RL Challenges

Non-stationarity. When multiple agents learn simultaneously, each agent's optimal policy depends on the others' policies, which are also changing. From any single agent's perspective, the environment is non-stationary. Standard convergence guarantees for single-agent RL do not apply.

Partial observability. In many multi-agent settings, agents cannot observe each other's states or actions. This requires reasoning about beliefs over hidden information, turning the problem into a decentralized POMDP (Dec-POMDP), which is NEXP-complete in general.

Centralized Training with Decentralized Execution (CTDE). During training, a centralized critic can access global state and all agents' actions. During execution, each agent acts based only on its local observations. QMIX and MAPPO follow this paradigm.

Independent learning. Each agent treats others as part of the environment and runs standard single-agent RL. Surprisingly effective in practice despite no convergence guarantees. Independent PPO is a strong MARL baseline.

Common Confusions

Watch Out

Self-play does not always converge

Self-play converges in two-player zero-sum games (under appropriate conditions). In general-sum games, cooperative games, or games with more than two players, self-play can cycle, diverge, or converge to non-Nash fixed points. The zero-sum structure is doing substantial theoretical work.

Watch Out

Nash equilibrium is not always the right solution concept

In cooperative settings, you want Pareto-optimal outcomes, not Nash equilibria. In sequential games, you want subgame-perfect equilibria. In games with communication, correlated equilibria may be more natural. The solution concept must match the game structure.

Exercises

ExerciseCore

Problem

Consider Rock-Paper-Scissors. Suppose player 1 uses fictitious play starting with Rock. Player 2 best-responds. Trace the first 6 rounds of fictitious play and compute the empirical frequency of each action for both players. Is it converging toward (1/3,1/3,1/3)(1/3, 1/3, 1/3)?

ExerciseAdvanced

Problem

Prove that if both players in a zero-sum game have regret at most R(T)R(T) after TT rounds, then the average strategy profile is a 2R(T)/T2R(T)/T-Nash equilibrium.

Related Comparisons

References

Canonical:

  • Shoham & Leyton-Brown, Multiagent Systems (2009), Chapters 4, 7
  • Brown, "Iterative Solutions of Games by Fictitious Play" (1951)

Current:

  • Silver et al., "Mastering the Game of Go without Human Knowledge" (Nature 2017)
  • Yu et al., "The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games" (NeurIPS 2022)

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.