What Each Does
Both self-play and independent learning are strategies for training agents in multi-agent environments. They differ in how they handle the presence of other agents.
Self-play trains an agent by having it play against copies of itself (or past versions of itself). The opponent changes as the agent improves, creating a non-stationary training environment. The agent and opponent co-evolve.
Independent learning trains each agent separately using a standard single-agent RL algorithm (Q-learning, policy gradient, etc.), treating all other agents as part of the environment. Each agent ignores the fact that other agents are also learning.
Side-by-Side Statement
Self-Play
Agent plays against agent , where 's policy is a copy of 's current policy or a past version . Training alternates between:
- Generate episodes with vs.
- Update using the collected data
- Update (or sample from a pool of past policies)
The Nash equilibrium concept is the target: find such that is a best response to itself.
Independent Learning
Each agent runs its own RL algorithm, treating the joint actions of all other agents as part of the environment transition:
Agent updates using its own reward signal. It does not model or observe other agents' policies, rewards, or updates.
Where Each Is Stronger
Self-play wins in two-player zero-sum games
For two-player zero-sum games (chess, Go, poker), self-play has strong theoretical and empirical support. If the learning algorithm converges to a best response against the current opponent, and the opponent pool is diverse enough, self-play converges to a Nash equilibrium.
AlphaGo, AlphaZero, and OpenAI Five all used self-play. In these settings, the Nash equilibrium is the minimax-optimal strategy, so converging to it is the right goal.
Independent learning wins on simplicity
Independent learning requires no coordination between agents. Each agent runs a standard RL algorithm without modification. There is no need to maintain opponent pools, synchronize training, or handle the non-stationarity explicitly. For problems where agents interact weakly, independent learning can work well with minimal engineering.
Where Each Fails
Self-play fails in general-sum games
In general-sum games (e.g., coordination games, social dilemmas), the Nash equilibrium may not be unique, and not all Nash equilibria are desirable. Self-play can converge to a poor equilibrium or cycle between strategies. There is no guarantee that the co-evolution leads to cooperative or socially optimal behavior.
Self-play fails with insufficient diversity
If the agent only trains against its current self, it can develop a narrow strategy that exploits specific weaknesses of that exact policy. When faced with a different opponent, the strategy collapses. This is why practical self-play systems maintain a league of past policies (as in AlphaStar) to ensure robustness.
Independent learning fails on non-stationarity
From agent 's perspective, the environment includes agents , which are all changing their policies. This violates the stationarity assumption of single-agent RL theory. Q-learning convergence proofs require a stationary MDP; when other agents learn simultaneously, the MDP changes at every step. Independent learners can cycle indefinitely in simple matrix games like matching pennies.
Independent learning fails to coordinate
In cooperative settings where agents must coordinate (e.g., both must choose action A simultaneously for a reward), independent learners have no mechanism to align their policies. Each agent's optimal action depends on the other's, and without communication or shared training, convergence to coordinated behavior is not guaranteed.
Key Assumptions That Differ
| Self-Play | Independent Learning | |
|---|---|---|
| Opponent model | Copy of self (or past self) | None (other agents are part of env) |
| Stationarity | Explicitly non-stationary | Assumes stationarity (incorrectly) |
| Target concept | Nash equilibrium | Best response to fixed env |
| Training cost | Higher (must maintain opponent pool) | Lower (standard single-agent RL) |
| Convergence | Guaranteed in 2P zero-sum (under conditions) | No general guarantees |
| Coordination | Implicit through co-evolution | None |
The Core Issue: Non-Stationarity
Non-Stationarity of Independent Learning
Statement
When agents simultaneously update their policies using independent learning, agent faces a non-stationary environment. The effective transition kernel for agent is:
Because each changes at every step , is time-varying. Standard RL convergence results (which require a fixed MDP) do not apply. In contrast, self-play explicitly accounts for opponent non-stationarity by training against the evolving opponent.
Intuition
Independent learning pretends the world is stationary when it is not. Each agent is trying to hit a moving target: the optimal policy depends on what other agents do, and other agents keep changing. Self-play embraces this non-stationarity by making the opponent's evolution part of the training process.
What to Memorize
-
Self-play: Train against yourself. Works in two-player zero-sum games. Needs opponent diversity.
-
Independent learning: Ignore other agents. Simple but violates stationarity. Can cycle.
-
The convergence divide: Self-play converges to Nash equilibrium in two-player zero-sum (under mild conditions). Independent learning has no such guarantee.
-
Practical lesson: If you have a zero-sum competitive game, use self-play with a league of past policies. If agents interact weakly, independent learning may suffice.
When a Researcher Would Use Each
Training a game-playing AI
For chess, Go, or poker, use self-play with a diverse opponent pool. The two-player zero-sum structure guarantees that the Nash equilibrium is the minimax strategy, and self-play has proven empirically effective at finding it. AlphaZero used self-play exclusively, with no human game data.
Traffic signal control
For decentralized traffic control where intersections are managed by separate agents, independent learning can work if intersections interact weakly. Each intersection optimizes its own signal timing based on local traffic. The coupling between intersections is indirect and often small enough that non-stationarity is manageable.
Competitive multiplayer with more than 2 players
For games with players, pure self-play is less well-founded theoretically. Population-based training, which maintains a diverse population of agents and selects opponents from the population, is a common hybrid approach. This retains the co-evolution of self-play while providing opponent diversity.
Common Confusions
Self-play is not the same as fictitious play
Fictitious play maintains a model of the opponent's average historical strategy and best-responds to it. Self-play trains against the opponent's current strategy (or a sample from a pool of past strategies). Fictitious play converges in more game classes than naive self-play, but it requires tracking the opponent's strategy history.
Independent learning can work in practice despite no guarantees
The theory says independent learning can cycle or diverge. In practice, for many multi-agent problems with weak coupling, independent learners converge to reasonable (if not optimal) policies. The gap between theory and practice is large here. But for strongly coupled environments (competitive games, cooperative tasks requiring coordination), independent learning reliably fails.
Nash equilibrium is not always the right target
In cooperative games, the socially optimal outcome may not be a Nash equilibrium. In general-sum games, different Nash equilibria can have wildly different payoffs. Self-play converges to a Nash equilibrium, not necessarily the best one. Additional mechanisms (communication, reward shaping, centralized training) are needed to target specific equilibria.