On-Policy vs. Off-Policy RL. Sample Efficiency vs. Stability

What Each Paradigm Does

On-policy and off-policy are the two fundamental paradigms for how a reinforcement learning agent uses experience data. The distinction is simple but has deep consequences for sample efficiency, stability, and algorithm design.

On-policy: The agent learns from data generated by its current policy $\pi$ . After each policy update, all old data is discarded because it was generated by a different policy.

Off-policy: The agent learns from data generated by any policy, including past versions of itself or even a completely different behavior policy $\mu$ . Old data is stored in a replay buffer and reused.

Side-by-Side Formulation

Definition

On-Policy Value Estimation

The on-policy value of state $s$ under policy $\pi$ is:

$V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{t=0}^\infty \gamma^t r_t \;\middle|\; s_0 = s\right]$

On-policy methods estimate this by sampling trajectories from $\pi$ itself. SARSA updates Q-values using the action actually taken by $\pi$ :

$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\!\left[r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\right]$

where $a_{t+1} \sim \pi(\cdot|s_{t+1})$ .

Definition

Off-Policy Value Estimation

Off-policy methods estimate the value of a target policy $\pi$ using data from a behavior policy $\mu \neq \pi$ . Q-learning does this by bootstrapping from the maximum over actions:

$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\!\left[r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right]$

The $\max$ makes the update independent of which action $\mu$ actually chose at $s_{t+1}$ , enabling off-policy learning without importance sampling corrections.

Where Each Is Stronger

On-policy wins on stability

On-policy data always matches the current policy, so there is no distribution mismatch between the data used for learning and the policy being improved. This eliminates an entire class of instabilities:

No stale data from old policies corrupting value estimates
No need for importance sampling corrections that can have high variance
Policy gradient estimates are unbiased (given enough samples)

PPO, A2C, and REINFORCE are stable and reliable precisely because they only use fresh, on-policy data.

Off-policy wins on sample efficiency

Every transition $(s, a, r, s')$ collected by an off-policy agent can be reused across many gradient updates. A replay buffer of $10^6$ transitions can be sampled from repeatedly, amortizing the cost of environment interaction.

In domains where environment interaction is expensive (robotics, simulation-heavy games, real-world systems), this advantage is decisive. DQN, SAC, and TD3 all exploit replay buffers to learn effectively from limited experience.

The Core Tradeoff

	On-Policy	Off-Policy
Data source	Current policy $\pi$	Any policy $\mu$ (replay buffer)
Data reuse	None (discard after update)	Extensive (replay many times)
Sample efficiency	Low	High
Stability	High (no distributional shift)	Lower (stale data, overestimation)
Variance	Can be high (few samples per update)	Lower (many samples available)
Bias	Low (data matches policy)	Can be high (distributional mismatch)
Classic algorithms	SARSA, REINFORCE, PPO, A2C	Q-learning, DQN, SAC, TD3

The Deadly Triad and Off-Policy Instability

Proposition

The Deadly Triad

Statement

The combination of (1) function approximation, (2) bootstrapping (using estimated values to update estimated values), and (3) off-policy data can cause value estimates to diverge. Any two of these three components are safe in isolation; the instability arises only when all three are present simultaneously.

Intuition

Off-policy data means the distribution of states in the replay buffer does not match the distribution under the current policy. Bootstrapping propagates errors from one state to another. Function approximation means correcting errors in one state can create errors in similar states. When all three interact, errors can amplify in a self-reinforcing cycle.

This is why tabular Q-learning converges (no function approximation), why on-policy TD converges with function approximation (no off-policy data), and why DQN required target networks and replay buffer tricks to stabilize deep off-policy learning.

report a correction →

Importance Sampling: Bridging the Gap

When using off-policy data for policy gradient methods, you need importance sampling to correct for the distribution mismatch:

$\nabla J(\theta) = \mathbb{E}_{a \sim \mu}\!\left[\frac{\pi_\theta(a|s)}{\mu(a|s)} \nabla \log \pi_\theta(a|s)\,\hat{A}(s,a)\right]$

The ratio $\pi/\mu$ corrects for the fact that actions were sampled from $\mu$ , not $\pi$ . The problem: these ratios can have enormous variance when $\pi$ and $\mu$ differ significantly, especially over long trajectories where the ratios multiply.

This variance explosion is a fundamental reason why pure importance- sampling-based off-policy policy gradients are impractical. Methods like V-trace, Retrace, and Tree-backup truncate or clip these ratios to trade bias for reduced variance.

When a Practitioner Would Use Each

Example

LLM alignment with RLHF

Use on-policy (PPO). Each training iteration generates new completions from the current LLM, scores them with a reward model, and updates the policy. On-policy data ensures the reward model evaluates text that the current policy would actually produce, avoiding reward hacking on out-of-distribution text.

Example

Game playing from pixels

Use off-policy (DQN or its variants). Environment simulation is fast, but learning from pixels requires millions of frames. Replay buffers let the agent reuse frames across many updates, and the discrete action space makes Q-learning natural.

Example

Continuous robot control in simulation

Use off-policy (SAC or TD3). Simulation provides unlimited data, but each gradient step should extract as much information as possible. The entropy regularization in SAC also provides automatic exploration in the high-dimensional continuous action space.

Example

Multi-agent competitive settings

Use on-policy (PPO with self-play). The opponent's strategy changes as both agents learn, making old experience from the replay buffer unreliable. On-policy methods naturally adapt because they only use data from the current joint policy.

Common Confusions

Watch Out

Off-policy does not mean the behavior policy is fixed

A common misconception is that off-policy learning requires a separate, fixed behavior policy. In practice, the behavior policy is usually a previous version of the learning agent itself. The replay buffer contains data from many past policy versions. The key property is that the learning algorithm can use this data without requiring it to come from the current policy.

Watch Out

On-policy does not mean one sample per update

On-policy methods can collect large batches of data before updating. PPO typically collects thousands of timesteps per iteration and performs multiple epochs of gradient descent on that batch. The constraint is that the batch must come from the current policy, not that the batch must be small.

Watch Out

Experience replay is not the only off-policy technique

Replay buffers are the most common mechanism, but off-policy learning also includes learning from demonstrations (behavior cloning data), learning from other agents, and learning from model-generated data (in model-based RL). Any data source that is not the current policy qualifies.