What Each Algorithm Does
Both PPO and SAC are actor-critic methods that maintain a policy (actor) and a value function (critic). They both aim to find a policy that maximizes expected cumulative reward. The difference is how they do it: what objective they optimize, how they use data, and what auxiliary goals they pursue.
PPO (Proximal Policy Optimization) is an on-policy method that constrains policy updates to stay close to the current policy using a clipped surrogate objective.
SAC (Soft Actor-Critic) is an off-policy method that maximizes a modified objective including an entropy bonus, encouraging exploration while learning from a replay buffer.
Side-by-Side Objectives
PPO Clipped Surrogate Objective
PPO maximizes:
where is the probability ratio and is the estimated advantage. The clipping with prevents the policy from changing too much in a single update.
SAC Maximum Entropy Objective
SAC maximizes the entropy-augmented return:
where is the policy entropy and is a temperature parameter (often automatically tuned). The entropy bonus explicitly encourages exploration and leads to more robust policies.
Where Each Is Stronger
PPO wins on simplicity and breadth
PPO works with both discrete and continuous action spaces. Its implementation is straightforward: collect a batch of trajectories, compute advantages using GAE, take a few gradient steps on the clipped objective, then discard the data. The clipping mechanism is simple to implement and does not require careful tuning of Lagrange multipliers or dual variables.
PPO is the default algorithm for RLHF in large language models. The discrete token action space, the need for stable updates near a reference policy, and the computational cost of generating samples all favor PPO's on-policy design.
SAC wins on sample efficiency and continuous control
SAC stores all experience in a replay buffer and reuses it across many gradient updates. This makes it far more sample-efficient than PPO, which discards data after each policy update. For robotics tasks where each environment interaction is expensive, this advantage is decisive.
The entropy bonus provides automatic exploration without the need for explicit exploration strategies (epsilon-greedy, noise injection). SAC policies are also more robust to perturbations because the entropy regularization encourages stochastic, multi-modal behavior.
Key Assumptions That Differ
| PPO | SAC | |
|---|---|---|
| Data usage | On-policy (discard after update) | Off-policy (replay buffer) |
| Action space | Discrete or continuous | Continuous (standard); discrete variants exist but are less common |
| Objective | Clipped surrogate (stay near old policy) | Maximum entropy (explore while exploiting) |
| Exploration | Implicit via stochastic policy | Explicit via entropy bonus |
| Sample efficiency | Low (each sample used a few times) | High (each sample reused many times) |
| Update stability | Clipping prevents large updates | Dual Q-networks and entropy tuning stabilize |
The Stability-Efficiency Tradeoff
On-Policy Stability vs. Off-Policy Efficiency
Statement
PPO makes small, stable updates but discards data after each policy iteration, requiring environment interactions for -optimal policies in typical benchmarks. SAC reuses data via replay buffers, achieving similar performance with -- fewer environment interactions, but requires careful management of stale data and distributional shift between the replay buffer and the current policy.
Intuition
On-policy methods are stable because the data distribution always matches the current policy. Off-policy methods are efficient because they reuse data, but the mismatch between the data-collecting policy and the current policy introduces bias that must be controlled. SAC handles this via the entropy regularization (which keeps the policy from collapsing to a deterministic one that diverges far from the replay distribution) and twin Q-networks (which mitigate overestimation bias).
When a Practitioner Would Use Each
RLHF for language models
Use PPO. The action space is discrete (tokens), the reward signal comes from a reward model that is expensive to query, and stability is paramount because the policy must stay close to a supervised fine-tuned reference. PPO's clipping objective naturally enforces this proximity. The KL penalty variant of PPO is standard in RLHF pipelines.
Robotic manipulation
Use SAC. Continuous joint-torque control, expensive real-world samples, and the need for robust multi-modal behavior all favor SAC. The entropy bonus helps the robot discover diverse grasping strategies rather than collapsing to a single brittle approach.
Atari and discrete games
Use PPO. SAC was not designed for discrete spaces, and while discrete SAC variants exist, PPO with GAE remains the standard baseline for discrete-action environments. PPO's simplicity and reliability make it the default choice.
Locomotion with sim-to-real transfer
SAC for learning in simulation (where sample efficiency matters less but entropy-regularized policies transfer better), then fine-tune with PPO in the real world (where stability and minimal real-world samples are critical). This hybrid approach is common in robotics research.
Common Confusions
PPO is not a trust region method
PPO is often described as an approximation to TRPO (Trust Region Policy Optimization). While inspired by TRPO, PPO does not solve a constrained optimization problem. The clipping is a heuristic empirically effective in continuous-control and RLHF training (Schulman et al., 2017), but provides no hard guarantee on the KL divergence between old and new policies. For guaranteed trust regions, use TRPO.
SAC entropy is not just exploration noise
The entropy bonus in SAC is not merely a trick for exploration. It changes the optimal policy from deterministic to stochastic. The maximum entropy policy is provably more robust to model misspecification and environmental perturbations. This is a feature, not a workaround.
Sample efficiency is not the same as wall-clock efficiency
SAC is more sample-efficient (fewer environment interactions), but each gradient step is more expensive because it involves sampling from a replay buffer, updating two Q-networks, and updating the policy and temperature. PPO's simpler update can be faster per iteration. Total wall-clock time depends on the relative cost of environment steps vs. gradient steps.