RL Theory
Markov Games and Self-Play
Multi-agent extensions of MDPs where multiple agents with separate rewards interact. Nash equilibria, minimax values in zero-sum games, and self-play as a training method.
Prerequisites
Why This Matters
Single-agent MDPs cannot model competition or cooperation between multiple decision makers. Markov games (also called stochastic games) extend MDPs to multiple agents, each with its own policy and reward function. Zero-sum Markov games model pure competition: one agent's gain is the other's loss. The minimax theorem guarantees that these games have well-defined values, and self-play provides a practical method for computing equilibrium policies. AlphaGo, AlphaZero, and OpenAI Five all rely on self-play in Markov games.
Mental Model
An MDP has one agent choosing actions to maximize reward. A Markov game has agents, each choosing actions simultaneously at every state. The state transitions depend on the joint action of all agents. Each agent has its own reward function. A Nash equilibrium is a tuple of policies where no agent can improve its expected return by changing its policy alone, given that the other agents' policies are fixed.
Formal Setup
Markov Game
A Markov game (stochastic game) consists of: agents, state space , action spaces for agent , transition function , reward functions for each agent , and discount factor . At each step, all agents simultaneously choose actions, receive individual rewards, and the state transitions.
When , a Markov game reduces to a standard MDP. When and , it is a two-player zero-sum Markov game.
Nash Equilibrium in Markov Games
A tuple of policies is a Nash equilibrium if for every agent and every alternative policy :
where is the expected discounted return for agent starting from state when all agents follow the specified policies.
Core Theory
Minimax Theorem for Zero-Sum Markov Games
Statement
In a two-player zero-sum Markov game with finite state and action spaces and discount factor , there exists a value function and a pair of policies such that:
The policies form a Nash equilibrium. Both may be stochastic (mixed strategies over actions).
Intuition
Just as single-agent MDPs have a well-defined optimal value function, zero-sum Markov games have a well-defined game value at each state. The max player cannot guarantee more than , and the min player cannot force the value below . The minimax and maximin values coincide, so neither player benefits from moving first or second.
Proof Sketch
Define the Shapley operator on value functions: where is the matrix game with entries and "val" denotes the minimax value of the matrix game (which exists by von Neumann's minimax theorem). The operator is a -contraction in the sup-norm. By Banach's fixed point theorem, it has a unique fixed point . The equilibrium policies are obtained from the minimax strategies of the matrix games at each state.
Why It Matters
This guarantees that zero-sum Markov games are theoretically well-posed: the game has a definite value and optimal strategies exist. This is the theoretical foundation for training competitive agents via self-play. If self-play converges, it converges to the Nash equilibrium value.
Failure Mode
The theorem requires finite state and action spaces. For continuous spaces, existence results require additional regularity conditions. For general-sum games ( or non-zero-sum), Nash equilibria exist but may not be unique, computing them is PPAD-hard, and the minimax equality fails. Self-play in general-sum games may cycle rather than converge.
Self-Play
Self-play trains an agent by having it play against copies of itself (or against past versions of itself). This connects to the broader exploration vs. exploitation trade-off: the agent must balance exploiting known strategies against exploring new ones. The key idea: if an agent improves its policy against the current opponent, and the opponent is a copy of the agent, then the agent is improving against its own weaknesses.
Fictitious Play
In fictitious play, each agent maintains a model of the opponent's strategy as the empirical average of past actions. At each round, each agent best-responds to the opponent's empirical strategy. For two-player zero-sum games, the empirical strategies converge to a Nash equilibrium (Robinson, 1951). Convergence can be slow.
Neural Self-Play
AlphaGo Zero and AlphaZero use Monte Carlo tree search (MCTS) guided by a neural network that outputs both a policy and a value estimate, trained with policy gradient methods. The network is trained on games played against itself. The training loop: (1) play games using MCTS + current network, (2) use game outcomes as training targets for the value head, (3) use MCTS visit counts as training targets for the policy head, (4) repeat. Each iteration improves the network, which improves the quality of self-play games, which provides better training data.
Population-Based Training
A weakness of pure self-play is that the agent may overfit to its own style and fail against diverse opponents. Population-based training maintains a pool of agents that play against each other. This encourages robust strategies and helps avoid strategy collapse. OpenAI Five used this approach for Dota 2.
Common Confusions
Nash equilibrium does not mean optimal play
A Nash equilibrium is a fixed point: no agent can unilaterally improve. But Nash equilibria can be Pareto-dominated. In the prisoner's dilemma, the Nash equilibrium (both defect) gives both players worse outcomes than mutual cooperation. In zero-sum games this issue does not arise because the equilibrium is minimax optimal.
Self-play convergence is not guaranteed in general
Self-play converges to Nash equilibrium in two-player zero-sum games under certain conditions (e.g., fictitious play). In general-sum or multi-player games, self-play can cycle, diverge, or converge to non-equilibrium strategies. The success of self-play in Go and chess relies on the zero-sum structure.
Stochastic policies may be necessary
In matrix games, pure strategy Nash equilibria may not exist (e.g., rock-paper-scissors). The same applies to Markov games: the equilibrium policies may need to be stochastic, mixing over actions at some states. A deterministic best response to a deterministic opponent can be exploitable.
Key Takeaways
- Markov games extend MDPs to agents with individual reward functions
- In two-player zero-sum Markov games, the minimax theorem guarantees a unique game value and Nash equilibrium
- The Shapley operator generalizes the Bellman operator by solving a matrix game at each state
- Self-play finds Nash equilibria in zero-sum games by iteratively improving against the agent's own policy
- AlphaZero combines neural networks with MCTS self-play for superhuman game play
- General-sum games are harder: equilibria may be non-unique, self-play may not converge
Exercises
Problem
Consider a two-player zero-sum Markov game with a single state (a repeated matrix game). Player 1 chooses rows, Player 2 chooses columns. The payoff matrix is:
Find the Nash equilibrium mixed strategies and the game value. Assume (single-stage game).
Problem
Prove that the Shapley operator is a -contraction in the sup-norm, where is the reward matrix at state , is the matrix of expected next-state values, and denotes the minimax value of a matrix game.
References
Canonical:
- Shapley, Stochastic Games, PNAS (1953)
- Littman, Markov Games as a Framework for Multi-Agent Reinforcement Learning, ICML (1994)
Current:
- Silver et al., Mastering the Game of Go without Human Knowledge, Nature (2017)
- Silver et al., A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go, Science (2018)
- Shoham and Leyton-Brown, Multiagent Systems (2009), Chapters 4, 6
Next Topics
- Multi-agent RL algorithms for general-sum games
- Population-based training and league training
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A