Self-Play vs. Independent Learning. Multi-Agent RL Convergence

What Each Does

Both self-play and independent learning are strategies for training agents in multi-agent environments. They differ in how they handle the presence of other agents.

Self-play trains an agent by having it play against copies of itself (or past versions of itself). The opponent changes as the agent improves, creating a non-stationary training environment. The agent and opponent co-evolve.

Independent learning trains each agent separately using a standard single-agent RL algorithm (Q-learning, policy gradient, etc.), treating all other agents as part of the environment. Each agent ignores the fact that other agents are also learning.

Side-by-Side Statement

Definition

Self-Play

Agent $i$ plays against agent $j$ , where $j$ 's policy is a copy of $i$ 's current policy $\pi_i$ or a past version $\pi_i^{(t-k)}$ . Training alternates between:

Generate episodes with $\pi_i$ vs. $\pi_j$
Update $\pi_i$ using the collected data
Update $\pi_j \leftarrow \pi_i$ (or sample from a pool of past policies)

The Nash equilibrium concept is the target: find $\pi^*$ such that $\pi^*$ is a best response to itself.

Definition

Independent Learning

Each agent $i$ runs its own RL algorithm, treating the joint actions of all other agents as part of the environment transition:

$Q_i(s, a_i) \leftarrow Q_i(s, a_i) + \alpha [r_i + \gamma \max_{a'_i} Q_i(s', a'_i) - Q_i(s, a_i)]$

Agent $i$ updates using its own reward signal. It does not model or observe other agents' policies, rewards, or updates.

Where Each Is Stronger

Self-play wins in two-player zero-sum games

For two-player zero-sum games (chess, Go, poker), self-play has strong theoretical and empirical support. If the learning algorithm converges to a best response against the current opponent, and the opponent pool is diverse enough, self-play converges to a Nash equilibrium.

AlphaGo, AlphaZero, and OpenAI Five all used self-play. In these settings, the Nash equilibrium is the minimax-optimal strategy, so converging to it is the right goal.

Independent learning wins on simplicity

Independent learning requires no coordination between agents. Each agent runs a standard RL algorithm without modification. There is no need to maintain opponent pools, synchronize training, or handle the non-stationarity explicitly. For problems where agents interact weakly, independent learning can work well with minimal engineering.

Where Each Fails

Self-play fails in general-sum games

In general-sum games (e.g., coordination games, social dilemmas), the Nash equilibrium may not be unique, and not all Nash equilibria are desirable. Self-play can converge to a poor equilibrium or cycle between strategies. There is no guarantee that the co-evolution leads to cooperative or socially optimal behavior.

Self-play fails with insufficient diversity

If the agent only trains against its current self, it can develop a narrow strategy that exploits specific weaknesses of that exact policy. When faced with a different opponent, the strategy collapses. This is why practical self-play systems maintain a league of past policies (as in AlphaStar) to ensure robustness.

Independent learning fails on non-stationarity

From agent $i$ 's perspective, the environment includes agents $j \neq i$ , which are all changing their policies. This violates the stationarity assumption of single-agent RL theory. Q-learning convergence proofs require a stationary MDP; when other agents learn simultaneously, the MDP changes at every step. Independent learners can cycle indefinitely in simple matrix games like matching pennies.

Independent learning fails to coordinate

In cooperative settings where agents must coordinate (e.g., both must choose action A simultaneously for a reward), independent learners have no mechanism to align their policies. Each agent's optimal action depends on the other's, and without communication or shared training, convergence to coordinated behavior is not guaranteed.

Key Assumptions That Differ

	Self-Play	Independent Learning
Opponent model	Copy of self (or past self)	None (other agents are part of env)
Stationarity	Explicitly non-stationary	Assumes stationarity (incorrectly)
Target concept	Nash equilibrium	Best response to fixed env
Training cost	Higher (must maintain opponent pool)	Lower (standard single-agent RL)
Convergence	Guaranteed in 2P zero-sum (under conditions)	No general guarantees
Coordination	Implicit through co-evolution	None

The Core Issue: Non-Stationarity

Proposition

Non-Stationarity of Independent Learning

Statement

When $K$ agents simultaneously update their policies using independent learning, agent $i$ faces a non-stationary environment. The effective transition kernel for agent $i$ is:

$P_i(s' | s, a_i) = \sum_{a_{-i}} P(s' | s, a_i, a_{-i}) \prod_{j \neq i} \pi_j^{(t)}(a_j | s)$

Because each $\pi_j^{(t)}$ changes at every step $t$ , $P_i$ is time-varying. Standard RL convergence results (which require a fixed MDP) do not apply. In contrast, self-play explicitly accounts for opponent non-stationarity by training against the evolving opponent.

Intuition

Independent learning pretends the world is stationary when it is not. Each agent is trying to hit a moving target: the optimal policy depends on what other agents do, and other agents keep changing. Self-play embraces this non-stationarity by making the opponent's evolution part of the training process.

report a correction →

What to Memorize

Self-play: Train against yourself. Works in two-player zero-sum games. Needs opponent diversity.
Independent learning: Ignore other agents. Simple but violates stationarity. Can cycle.
The convergence divide: Self-play converges to Nash equilibrium in two-player zero-sum (under mild conditions). Independent learning has no such guarantee.
Practical lesson: If you have a zero-sum competitive game, use self-play with a league of past policies. If agents interact weakly, independent learning may suffice.

When a Researcher Would Use Each

Example

Training a game-playing AI

For chess, Go, or poker, use self-play with a diverse opponent pool. The two-player zero-sum structure guarantees that the Nash equilibrium is the minimax strategy, and self-play has proven empirically effective at finding it. AlphaZero used self-play exclusively, with no human game data.

Example

Traffic signal control

For decentralized traffic control where intersections are managed by separate agents, independent learning can work if intersections interact weakly. Each intersection optimizes its own signal timing based on local traffic. The coupling between intersections is indirect and often small enough that non-stationarity is manageable.

Example

Competitive multiplayer with more than 2 players

For games with $K > 2$ players, pure self-play is less well-founded theoretically. Population-based training, which maintains a diverse population of agents and selects opponents from the population, is a common hybrid approach. This retains the co-evolution of self-play while providing opponent diversity.

Common Confusions

Watch Out

Self-play is not the same as fictitious play

Fictitious play maintains a model of the opponent's average historical strategy and best-responds to it. Self-play trains against the opponent's current strategy (or a sample from a pool of past strategies). Fictitious play converges in more game classes than naive self-play, but it requires tracking the opponent's strategy history.

Watch Out

Independent learning can work in practice despite no guarantees

The theory says independent learning can cycle or diverge. In practice, for many multi-agent problems with weak coupling, independent learners converge to reasonable (if not optimal) policies. The gap between theory and practice is large here. But for strongly coupled environments (competitive games, cooperative tasks requiring coordination), independent learning reliably fails.

Watch Out

Nash equilibrium is not always the right target

In cooperative games, the socially optimal outcome may not be a Nash equilibrium. In general-sum games, different Nash equilibria can have wildly different payoffs. Self-play converges to a Nash equilibrium, not necessarily the best one. Additional mechanisms (communication, reward shaping, centralized training) are needed to target specific equilibria.