Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Markov Games and Self-Play

Multi-agent extensions of MDPs where multiple agents with separate rewards interact. Nash equilibria, minimax values in zero-sum games, and self-play as a training method.

AdvancedTier 2Stable~50 min
0

Why This Matters

Single-agent MDPs cannot model competition or cooperation between multiple decision makers. Markov games (also called stochastic games) extend MDPs to multiple agents, each with its own policy and reward function. Zero-sum Markov games model pure competition: one agent's gain is the other's loss. The minimax theorem guarantees that these games have well-defined values, and self-play provides a practical method for computing equilibrium policies. AlphaGo, AlphaZero, and OpenAI Five all rely on self-play in Markov games.

Mental Model

An MDP has one agent choosing actions to maximize reward. A Markov game has NN agents, each choosing actions simultaneously at every state. The state transitions depend on the joint action of all agents. Each agent has its own reward function. A Nash equilibrium is a tuple of policies where no agent can improve its expected return by changing its policy alone, given that the other agents' policies are fixed.

Formal Setup

Definition

Markov Game

A Markov game (stochastic game) consists of: NN agents, state space SS, action spaces AiA_i for agent ii, transition function T(ss,a1,,aN)T(s' \mid s, a_1, \ldots, a_N), reward functions Ri(s,a1,,aN)R_i(s, a_1, \ldots, a_N) for each agent ii, and discount factor γ[0,1)\gamma \in [0,1). At each step, all agents simultaneously choose actions, receive individual rewards, and the state transitions.

When N=1N = 1, a Markov game reduces to a standard MDP. When N=2N = 2 and R1=R2R_1 = -R_2, it is a two-player zero-sum Markov game.

Definition

Nash Equilibrium in Markov Games

A tuple of policies (π1,,πN)(\pi_1^*, \ldots, \pi_N^*) is a Nash equilibrium if for every agent ii and every alternative policy πi\pi_i:

Viπ1,,πi,,πN(s)Viπ1,,πi,,πN(s)sSV_i^{\pi_1^*, \ldots, \pi_i^*, \ldots, \pi_N^*}(s) \geq V_i^{\pi_1^*, \ldots, \pi_i, \ldots, \pi_N^*}(s) \quad \forall s \in S

where Viπ1,,πN(s)V_i^{\pi_1, \ldots, \pi_N}(s) is the expected discounted return for agent ii starting from state ss when all agents follow the specified policies.

Core Theory

Theorem

Minimax Theorem for Zero-Sum Markov Games

Statement

In a two-player zero-sum Markov game with finite state and action spaces and discount factor γ(0,1)\gamma \in (0,1), there exists a value function V:SRV^*: S \to \mathbb{R} and a pair of policies (π1,π2)(\pi_1^*, \pi_2^*) such that:

V(s)=maxπ1minπ2V1π1,π2(s)=minπ2maxπ1V1π1,π2(s)sSV^*(s) = \max_{\pi_1} \min_{\pi_2} V_1^{\pi_1, \pi_2}(s) = \min_{\pi_2} \max_{\pi_1} V_1^{\pi_1, \pi_2}(s) \quad \forall s \in S

The policies (π1,π2)(\pi_1^*, \pi_2^*) form a Nash equilibrium. Both may be stochastic (mixed strategies over actions).

Intuition

Just as single-agent MDPs have a well-defined optimal value function, zero-sum Markov games have a well-defined game value at each state. The max player cannot guarantee more than V(s)V^*(s), and the min player cannot force the value below V(s)V^*(s). The minimax and maximin values coincide, so neither player benefits from moving first or second.

Proof Sketch

Define the Shapley operator T\mathcal{T} on value functions: (TV)(s)=val(Ms(V))(\mathcal{T}V)(s) = \text{val}(M_s(V)) where Ms(V)M_s(V) is the matrix game with entries R1(s,a1,a2)+γsT(ss,a1,a2)V(s)R_1(s, a_1, a_2) + \gamma \sum_{s'} T(s' \mid s, a_1, a_2) V(s') and "val" denotes the minimax value of the matrix game (which exists by von Neumann's minimax theorem). The operator T\mathcal{T} is a γ\gamma-contraction in the sup-norm. By Banach's fixed point theorem, it has a unique fixed point VV^*. The equilibrium policies are obtained from the minimax strategies of the matrix games at each state.

Why It Matters

This guarantees that zero-sum Markov games are theoretically well-posed: the game has a definite value and optimal strategies exist. This is the theoretical foundation for training competitive agents via self-play. If self-play converges, it converges to the Nash equilibrium value.

Failure Mode

The theorem requires finite state and action spaces. For continuous spaces, existence results require additional regularity conditions. For general-sum games (N>2N > 2 or non-zero-sum), Nash equilibria exist but may not be unique, computing them is PPAD-hard, and the minimax equality fails. Self-play in general-sum games may cycle rather than converge.

Self-Play

Self-play trains an agent by having it play against copies of itself (or against past versions of itself). This connects to the broader exploration vs. exploitation trade-off: the agent must balance exploiting known strategies against exploring new ones. The key idea: if an agent improves its policy against the current opponent, and the opponent is a copy of the agent, then the agent is improving against its own weaknesses.

Fictitious Play

In fictitious play, each agent maintains a model of the opponent's strategy as the empirical average of past actions. At each round, each agent best-responds to the opponent's empirical strategy. For two-player zero-sum games, the empirical strategies converge to a Nash equilibrium (Robinson, 1951). Convergence can be slow.

Neural Self-Play

AlphaGo Zero and AlphaZero use Monte Carlo tree search (MCTS) guided by a neural network that outputs both a policy and a value estimate, trained with policy gradient methods. The network is trained on games played against itself. The training loop: (1) play games using MCTS + current network, (2) use game outcomes as training targets for the value head, (3) use MCTS visit counts as training targets for the policy head, (4) repeat. Each iteration improves the network, which improves the quality of self-play games, which provides better training data.

Population-Based Training

A weakness of pure self-play is that the agent may overfit to its own style and fail against diverse opponents. Population-based training maintains a pool of agents that play against each other. This encourages robust strategies and helps avoid strategy collapse. OpenAI Five used this approach for Dota 2.

Common Confusions

Watch Out

Nash equilibrium does not mean optimal play

A Nash equilibrium is a fixed point: no agent can unilaterally improve. But Nash equilibria can be Pareto-dominated. In the prisoner's dilemma, the Nash equilibrium (both defect) gives both players worse outcomes than mutual cooperation. In zero-sum games this issue does not arise because the equilibrium is minimax optimal.

Watch Out

Self-play convergence is not guaranteed in general

Self-play converges to Nash equilibrium in two-player zero-sum games under certain conditions (e.g., fictitious play). In general-sum or multi-player games, self-play can cycle, diverge, or converge to non-equilibrium strategies. The success of self-play in Go and chess relies on the zero-sum structure.

Watch Out

Stochastic policies may be necessary

In matrix games, pure strategy Nash equilibria may not exist (e.g., rock-paper-scissors). The same applies to Markov games: the equilibrium policies may need to be stochastic, mixing over actions at some states. A deterministic best response to a deterministic opponent can be exploitable.

Key Takeaways

  • Markov games extend MDPs to NN agents with individual reward functions
  • In two-player zero-sum Markov games, the minimax theorem guarantees a unique game value and Nash equilibrium
  • The Shapley operator generalizes the Bellman operator by solving a matrix game at each state
  • Self-play finds Nash equilibria in zero-sum games by iteratively improving against the agent's own policy
  • AlphaZero combines neural networks with MCTS self-play for superhuman game play
  • General-sum games are harder: equilibria may be non-unique, self-play may not converge

Exercises

ExerciseCore

Problem

Consider a two-player zero-sum Markov game with a single state (a repeated matrix game). Player 1 chooses rows, Player 2 chooses columns. The payoff matrix is:

M=(3124)M = \begin{pmatrix} 3 & -1 \\ -2 & 4 \end{pmatrix}

Find the Nash equilibrium mixed strategies and the game value. Assume γ=0\gamma = 0 (single-stage game).

ExerciseAdvanced

Problem

Prove that the Shapley operator (TV)(s)=val(Rs+γPsV)(\mathcal{T}V)(s) = \text{val}(R_s + \gamma P_s V) is a γ\gamma-contraction in the sup-norm, where RsR_s is the reward matrix at state ss, PsVP_s V is the matrix of expected next-state values, and val()\text{val}(\cdot) denotes the minimax value of a matrix game.

References

Canonical:

  • Shapley, Stochastic Games, PNAS (1953)
  • Littman, Markov Games as a Framework for Multi-Agent Reinforcement Learning, ICML (1994)

Current:

  • Silver et al., Mastering the Game of Go without Human Knowledge, Nature (2017)
  • Silver et al., A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go, Science (2018)
  • Shoham and Leyton-Brown, Multiagent Systems (2009), Chapters 4, 6

Next Topics

  • Multi-agent RL algorithms for general-sum games
  • Population-based training and league training

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.