Markov Games and Self-Play

Sneiderman, Robby

RL Theory

Markov Games and Self-Play

Multi-agent extensions of MDPs where multiple agents with separate rewards interact. Nash equilibria, minimax values in zero-sum games, and self-play as a training method.

AdvancedTier 2StableSupporting~50 min

Prerequisites

Markov Decision Processes Reinforcement Learning for Auction Design

Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 3 | tier 2. This page has 2 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Agent-Based Modeling with ML

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Single-agent MDPs cannot model competition or cooperation between multiple decision makers. Markov games (also called stochastic games) extend MDPs to multiple agents, each with its own policy and reward function. Zero-sum Markov games model pure competition: one agent's gain is the other's loss. The minimax theorem guarantees that these games have well-defined values, and self-play provides a practical method for computing equilibrium policies. AlphaGo, AlphaZero, and OpenAI Five all rely on self-play in Markov games.

Mental Model

An MDP has one agent choosing actions to maximize reward. A Markov game has $N$ agents, each choosing actions simultaneously at every state. The state transitions depend on the joint action of all agents. Each agent has its own reward function. A Nash equilibrium is a tuple of policies where no agent can improve its expected return by changing its policy alone, given that the other agents' policies are fixed.

Formal Setup

Definition

Markov Game $(N, S, {A_{i}}, T, {R_{i}}, γ)$

A Markov game (stochastic game) consists of: $N$ agents, state space $S$ , action spaces $A_i$ for agent $i$ , transition function $T(s' \mid s, a_1, \ldots, a_N)$ , reward functions $R_i(s, a_1, \ldots, a_N)$ for each agent $i$ , and discount factor $\gamma \in [0,1)$ . At each step, all agents simultaneously choose actions, receive individual rewards, and the state transitions.

When $N = 1$ , a Markov game reduces to a standard MDP. When $N = 2$ and $R_1 = -R_2$ , it is a two-player zero-sum Markov game.

Definition

Nash Equilibrium in Markov Games

A tuple of policies $(\pi_1^*, \ldots, \pi_N^*)$ is a Nash equilibrium if for every agent $i$ and every alternative policy $\pi_i$ :

$V_i^{\pi_1^*, \ldots, \pi_i^*, \ldots, \pi_N^*}(s) \geq V_i^{\pi_1^*, \ldots, \pi_i, \ldots, \pi_N^*}(s) \quad \forall s \in S$

where $V_i^{\pi_1, \ldots, \pi_N}(s)$ is the expected discounted return for agent $i$ starting from state $s$ when all agents follow the specified policies.

Core Theory

Theorem

Minimax Theorem for Zero-Sum Markov Games

Statement

In a two-player zero-sum Markov game with finite state and action spaces and discount factor $\gamma \in (0,1)$ , there exists a value function $V^*: S \to \mathbb{R}$ and a pair of policies $(\pi_1^*, \pi_2^*)$ such that:

$V^*(s) = \max_{\pi_1} \min_{\pi_2} V_1^{\pi_1, \pi_2}(s) = \min_{\pi_2} \max_{\pi_1} V_1^{\pi_1, \pi_2}(s) \quad \forall s \in S$

The policies $(\pi_1^*, \pi_2^*)$ form a Nash equilibrium. Both may be stochastic (mixed strategies over actions).

Intuition

Just as single-agent MDPs have a well-defined optimal value function, zero-sum Markov games have a well-defined game value at each state. The max player cannot guarantee more than $V^*(s)$ , and the min player cannot force the value below $V^*(s)$ . The minimax and maximin values coincide, so neither player benefits from moving first or second.

Proof Sketch

Define the Shapley operator $\mathcal{T}$ on value functions: $(\mathcal{T}V)(s) = \text{val}(M_s(V))$ where $M_s(V)$ is the matrix game with entries $R_1(s, a_1, a_2) + \gamma \sum_{s'} T(s' \mid s, a_1, a_2) V(s')$ and "val" denotes the minimax value of the matrix game (which exists by von Neumann's minimax theorem). The operator $\mathcal{T}$ is a $\gamma$ -contraction in the sup-norm. By Banach's fixed point theorem, it has a unique fixed point $V^*$ . The equilibrium policies are obtained from the minimax strategies of the matrix games at each state.

Why It Matters

This guarantees that zero-sum Markov games are theoretically well-posed: the game has a definite value and optimal strategies exist. This is the theoretical foundation for training competitive agents via self-play. If self-play converges, it converges to the Nash equilibrium value.

Failure Mode

The theorem requires finite state and action spaces. For continuous spaces, existence results require additional regularity conditions. For general-sum games ( $N > 2$ or non-zero-sum), Nash equilibria exist but may not be unique, computing them is PPAD-hard, and the minimax equality fails. Self-play in general-sum games may cycle rather than converge.

report a correction →

Self-Play

Self-play trains an agent by having it play against copies of itself (or against past versions of itself). The key idea: if an agent improves its policy against the current opponent, and the opponent is a copy of the agent, then the agent is improving against its own weaknesses.

Fictitious Play

In fictitious play, each agent maintains a model of the opponent's strategy as the empirical average of past actions. At each round, each agent best-responds to the opponent's empirical strategy. For two-player zero-sum games, the empirical strategies converge to a Nash equilibrium (Robinson, 1951). Convergence can be slow.

Neural Self-Play

AlphaGo Zero and AlphaZero use Monte Carlo tree search (MCTS) guided by a neural network that outputs both a policy and a value estimate, trained with policy gradient methods. The network is trained on games played against itself. The training loop: (1) play games using MCTS + current network, (2) use game outcomes as training targets for the value head, (3) use MCTS visit counts as training targets for the policy head, (4) repeat. Each iteration improves the network, which improves the quality of self-play games, which provides better training data.

Population-Based Training

A weakness of pure self-play is that the agent may overfit to its own style and fail against diverse opponents. Population-based training maintains a pool of agents that play against each other. This encourages robust strategies and helps avoid strategy collapse. OpenAI Five used this approach for Dota 2.

Common Confusions

Watch Out

Nash equilibrium does not mean optimal play

A Nash equilibrium is a fixed point: no agent can unilaterally improve. But Nash equilibria can be Pareto-dominated. In the prisoner's dilemma, the Nash equilibrium (both defect) gives both players worse outcomes than mutual cooperation. In zero-sum games this issue does not arise because the equilibrium is minimax optimal.

Watch Out

Self-play convergence is not guaranteed in general

Self-play converges to Nash equilibrium in two-player zero-sum games under certain conditions (e.g., fictitious play). In general-sum or multi-player games, self-play can cycle, diverge, or converge to non-equilibrium strategies. The success of self-play in Go and chess relies on the zero-sum structure.

Watch Out

Stochastic policies may be necessary

In matrix games, pure strategy Nash equilibria may not exist (e.g., rock-paper-scissors). The same applies to Markov games: the equilibrium policies may need to be stochastic, mixing over actions at some states. A deterministic best response to a deterministic opponent can be exploitable.

Summary

Markov games extend MDPs to $N$ agents with individual reward functions
In two-player zero-sum Markov games, the minimax theorem guarantees a unique game value and Nash equilibrium
The Shapley operator generalizes the Bellman operator by solving a matrix game at each state
Self-play finds Nash equilibria in zero-sum games by iteratively improving against the agent's own policy
AlphaZero combines neural networks with MCTS self-play for superhuman game play
General-sum games are harder: equilibria may be non-unique, self-play may not converge

Exercises

ExerciseCore

Problem

Consider a two-player zero-sum Markov game with a single state (a repeated matrix game). Player 1 chooses rows, Player 2 chooses columns. The payoff matrix is:

$M = \begin{pmatrix} 3 & -1 \\ -2 & 4 \end{pmatrix}$

Find the Nash equilibrium mixed strategies and the game value. Assume $\gamma = 0$ (single-stage game).

ExerciseAdvanced

Problem

Prove that the Shapley operator $(\mathcal{T}V)(s) = \text{val}(R_s + \gamma P_s V)$ is a $\gamma$ -contraction in the sup-norm, where $R_s$ is the reward matrix at state $s$ , $P_s V$ is the matrix of expected next-state values, and $\text{val}(\cdot)$ denotes the minimax value of a matrix game.

References

Canonical:

Shapley, Stochastic Games, PNAS (1953)
Littman, Markov Games as a Framework for Multi-Agent Reinforcement Learning, ICML (1994)

Current:

Silver et al., Mastering the Game of Go without Human Knowledge, Nature (2017)
Silver et al., A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go, Science (2018)
Shoham and Leyton-Brown, Multiagent Systems (2009), Chapters 4, 6

Next Topics

Multi-agent RL algorithms for general-sum games
Population-based training and league training

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Markov Decision Processeslayer 2 · tier 1
Reinforcement Learning for Auction Designlayer 4 · tier 3

Derived topics

1

Agent-Based Modeling with MLlayer 4 · tier 3

Graph-backed continuations

Agent-Based Modeling with ML