PPO vs. SAC. On-Policy Clipping vs. Off-Policy Entropy

What Each Algorithm Does

Both PPO and SAC are actor-critic methods that maintain a policy (actor) and a value function (critic). They both aim to find a policy $\pi$ that maximizes expected cumulative reward. The difference is how they do it: what objective they optimize, how they use data, and what auxiliary goals they pursue.

PPO (Proximal Policy Optimization) is an on-policy method that constrains policy updates to stay close to the current policy using a clipped surrogate objective.

SAC (Soft Actor-Critic) is an off-policy method that maximizes a modified objective including an entropy bonus, encouraging exploration while learning from a replay buffer.

Side-by-Side Objectives

Definition

PPO Clipped Surrogate Objective

PPO maximizes:

$L^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t(\theta)\hat{A}_t,\; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$

where $r_t(\theta) = \pi_\theta(a_t|s_t)/\pi_{\theta_{\text{old}}}(a_t|s_t)$ is the probability ratio and $\hat{A}_t$ is the estimated advantage. The clipping with $\epsilon \approx 0.2$ prevents the policy from changing too much in a single update.

Definition

SAC Maximum Entropy Objective

SAC maximizes the entropy-augmented return:

$J(\pi) = \sum_{t=0}^T \mathbb{E}_{(s_t,a_t)\sim\rho_\pi}\!\left[r(s_t,a_t) + \alpha\,\mathcal{H}(\pi(\cdot|s_t))\right]$

where $\mathcal{H}(\pi(\cdot|s_t)) = -\mathbb{E}[\log\pi(a|s_t)]$ is the policy entropy and $\alpha$ is a temperature parameter (often automatically tuned). The entropy bonus explicitly encourages exploration and leads to more robust policies.

Where Each Is Stronger

PPO wins on simplicity and breadth

PPO works with both discrete and continuous action spaces. Its implementation is straightforward: collect a batch of trajectories, compute advantages using GAE, take a few gradient steps on the clipped objective, then discard the data. The clipping mechanism is simple to implement and does not require careful tuning of Lagrange multipliers or dual variables.

PPO is the default algorithm for RLHF in large language models. The discrete token action space, the need for stable updates near a reference policy, and the computational cost of generating samples all favor PPO's on-policy design.

SAC wins on sample efficiency and continuous control

SAC stores all experience in a replay buffer and reuses it across many gradient updates. This makes it far more sample-efficient than PPO, which discards data after each policy update. For robotics tasks where each environment interaction is expensive, this advantage is decisive.

The entropy bonus provides automatic exploration without the need for explicit exploration strategies (epsilon-greedy, noise injection). SAC policies are also more robust to perturbations because the entropy regularization encourages stochastic, multi-modal behavior.

Key Assumptions That Differ

	PPO	SAC
Data usage	On-policy (discard after update)	Off-policy (replay buffer)
Action space	Discrete or continuous	Continuous (standard); discrete variants exist but are less common
Objective	Clipped surrogate (stay near old policy)	Maximum entropy (explore while exploiting)
Exploration	Implicit via stochastic policy	Explicit via entropy bonus
Sample efficiency	Low (each sample used a few times)	High (each sample reused many times)
Update stability	Clipping prevents large updates	Dual Q-networks and entropy tuning stabilize

The Stability-Efficiency Tradeoff

Proposition

On-Policy Stability vs. Off-Policy Efficiency

Statement

PPO makes small, stable updates but discards data after each policy iteration, requiring $O(1/\epsilon^2)$ environment interactions for $\epsilon$ -optimal policies in typical benchmarks. SAC reuses data via replay buffers, achieving similar performance with $5$ -- $20\times$ fewer environment interactions, but requires careful management of stale data and distributional shift between the replay buffer and the current policy.

Intuition

On-policy methods are stable because the data distribution always matches the current policy. Off-policy methods are efficient because they reuse data, but the mismatch between the data-collecting policy and the current policy introduces bias that must be controlled. SAC handles this via the entropy regularization (which keeps the policy from collapsing to a deterministic one that diverges far from the replay distribution) and twin Q-networks (which mitigate overestimation bias).

report a correction →

When a Practitioner Would Use Each

Example

RLHF for language models

Use PPO. The action space is discrete (tokens), the reward signal comes from a reward model that is expensive to query, and stability is paramount because the policy must stay close to a supervised fine-tuned reference. PPO's clipping objective naturally enforces this proximity. The KL penalty variant of PPO is standard in RLHF pipelines.

Example

Robotic manipulation

Use SAC. Continuous joint-torque control, expensive real-world samples, and the need for robust multi-modal behavior all favor SAC. The entropy bonus helps the robot discover diverse grasping strategies rather than collapsing to a single brittle approach.

Example

Atari and discrete games

Use PPO. SAC was not designed for discrete spaces, and while discrete SAC variants exist, PPO with GAE remains the standard baseline for discrete-action environments. PPO's simplicity and reliability make it the default choice.

Example

Locomotion with sim-to-real transfer

SAC for learning in simulation (where sample efficiency matters less but entropy-regularized policies transfer better), then fine-tune with PPO in the real world (where stability and minimal real-world samples are critical). This hybrid approach is common in robotics research.

Common Confusions

Watch Out

PPO is not a trust region method

PPO is often described as an approximation to TRPO (Trust Region Policy Optimization). While inspired by TRPO, PPO does not solve a constrained optimization problem. The clipping is a heuristic empirically effective in continuous-control and RLHF training (Schulman et al., 2017), but provides no hard guarantee on the KL divergence between old and new policies. For guaranteed trust regions, use TRPO.

Watch Out

SAC entropy is not just exploration noise

The entropy bonus in SAC is not merely a trick for exploration. It changes the optimal policy from deterministic to stochastic. The maximum entropy policy is provably more robust to model misspecification and environmental perturbations. This is a feature, not a workaround.

Watch Out

Sample efficiency is not the same as wall-clock efficiency

SAC is more sample-efficient (fewer environment interactions), but each gradient step is more expensive because it involves sampling from a replay buffer, updating two Q-networks, and updating the policy and temperature. PPO's simpler update can be faster per iteration. Total wall-clock time depends on the relative cost of environment steps vs. gradient steps.