What Each Paradigm Does
On-policy and off-policy are the two fundamental paradigms for how a reinforcement learning agent uses experience data. The distinction is simple but has deep consequences for sample efficiency, stability, and algorithm design.
On-policy: The agent learns from data generated by its current policy . After each policy update, all old data is discarded because it was generated by a different policy.
Off-policy: The agent learns from data generated by any policy, including past versions of itself or even a completely different behavior policy . Old data is stored in a replay buffer and reused.
Side-by-Side Formulation
On-Policy Value Estimation
The on-policy value of state under policy is:
On-policy methods estimate this by sampling trajectories from itself. SARSA updates Q-values using the action actually taken by :
where .
Off-Policy Value Estimation
Off-policy methods estimate the value of a target policy using data from a behavior policy . Q-learning does this by bootstrapping from the maximum over actions:
The makes the update independent of which action actually chose at , enabling off-policy learning without importance sampling corrections.
Where Each Is Stronger
On-policy wins on stability
On-policy data always matches the current policy, so there is no distribution mismatch between the data used for learning and the policy being improved. This eliminates an entire class of instabilities:
- No stale data from old policies corrupting value estimates
- No need for importance sampling corrections that can have high variance
- Policy gradient estimates are unbiased (given enough samples)
PPO, A2C, and REINFORCE are stable and reliable precisely because they only use fresh, on-policy data.
Off-policy wins on sample efficiency
Every transition collected by an off-policy agent can be reused across many gradient updates. A replay buffer of transitions can be sampled from repeatedly, amortizing the cost of environment interaction.
In domains where environment interaction is expensive (robotics, simulation-heavy games, real-world systems), this advantage is decisive. DQN, SAC, and TD3 all exploit replay buffers to learn effectively from limited experience.
The Core Tradeoff
| On-Policy | Off-Policy | |
|---|---|---|
| Data source | Current policy | Any policy (replay buffer) |
| Data reuse | None (discard after update) | Extensive (replay many times) |
| Sample efficiency | Low | High |
| Stability | High (no distributional shift) | Lower (stale data, overestimation) |
| Variance | Can be high (few samples per update) | Lower (many samples available) |
| Bias | Low (data matches policy) | Can be high (distributional mismatch) |
| Classic algorithms | SARSA, REINFORCE, PPO, A2C | Q-learning, DQN, SAC, TD3 |
The Deadly Triad and Off-Policy Instability
The Deadly Triad
Statement
The combination of (1) function approximation, (2) bootstrapping (using estimated values to update estimated values), and (3) off-policy data can cause value estimates to diverge. Any two of these three components are safe in isolation; the instability arises only when all three are present simultaneously.
Intuition
Off-policy data means the distribution of states in the replay buffer does not match the distribution under the current policy. Bootstrapping propagates errors from one state to another. Function approximation means correcting errors in one state can create errors in similar states. When all three interact, errors can amplify in a self-reinforcing cycle.
This is why tabular Q-learning converges (no function approximation), why on-policy TD converges with function approximation (no off-policy data), and why DQN required target networks and replay buffer tricks to stabilize deep off-policy learning.
Importance Sampling: Bridging the Gap
When using off-policy data for policy gradient methods, you need importance sampling to correct for the distribution mismatch:
The ratio corrects for the fact that actions were sampled from , not . The problem: these ratios can have enormous variance when and differ significantly, especially over long trajectories where the ratios multiply.
This variance explosion is a fundamental reason why pure importance- sampling-based off-policy policy gradients are impractical. Methods like V-trace, Retrace, and Tree-backup truncate or clip these ratios to trade bias for reduced variance.
When a Practitioner Would Use Each
LLM alignment with RLHF
Use on-policy (PPO). Each training iteration generates new completions from the current LLM, scores them with a reward model, and updates the policy. On-policy data ensures the reward model evaluates text that the current policy would actually produce, avoiding reward hacking on out-of-distribution text.
Game playing from pixels
Use off-policy (DQN or its variants). Environment simulation is fast, but learning from pixels requires millions of frames. Replay buffers let the agent reuse frames across many updates, and the discrete action space makes Q-learning natural.
Continuous robot control in simulation
Use off-policy (SAC or TD3). Simulation provides unlimited data, but each gradient step should extract as much information as possible. The entropy regularization in SAC also provides automatic exploration in the high-dimensional continuous action space.
Multi-agent competitive settings
Use on-policy (PPO with self-play). The opponent's strategy changes as both agents learn, making old experience from the replay buffer unreliable. On-policy methods naturally adapt because they only use data from the current joint policy.
Common Confusions
Off-policy does not mean the behavior policy is fixed
A common misconception is that off-policy learning requires a separate, fixed behavior policy. In practice, the behavior policy is usually a previous version of the learning agent itself. The replay buffer contains data from many past policy versions. The key property is that the learning algorithm can use this data without requiring it to come from the current policy.
On-policy does not mean one sample per update
On-policy methods can collect large batches of data before updating. PPO typically collects thousands of timesteps per iteration and performs multiple epochs of gradient descent on that batch. The constraint is that the batch must come from the current policy, not that the batch must be small.
Experience replay is not the only off-policy technique
Replay buffers are the most common mechanism, but off-policy learning also includes learning from demonstrations (behavior cloning data), learning from other agents, and learning from model-generated data (in model-based RL). Any data source that is not the current policy qualifies.