RL Theory
Policy Optimization: PPO and TRPO
Trust region and clipped surrogate methods for stable policy optimization: TRPO enforces a KL constraint for monotonic improvement, PPO replaces it with a simpler clipped objective, and PPO became the default optimizer for LLM post-training.
Prerequisites
Why This Matters
Vanilla policy gradients (REINFORCE) suffer from high variance and destructive large updates. A single bad update can collapse a good policy. TRPO and PPO solve this by constraining how far the new policy can deviate from the old one.
PPO is the workhorse of modern RL. It was used to train the reward models and policies in RLHF for ChatGPT and most subsequent LLM post-training pipelines. Understanding the clipped surrogate objective is necessary for understanding why PPO works and where it breaks.
Formal Setup
Let be a parameterized policy and the current policy from which we collected trajectories. Define the probability ratio:
and the advantage estimated from the collected data (typically via GAE, generalized advantage estimation).
Surrogate Objective
The conservative policy iteration (CPI) surrogate objective is:
Maximizing this with respect to using data from is equivalent to a policy gradient step via importance sampling. Without constraints, large can cause catastrophic updates.
TRPO: Trust Region Approach
TRPO Monotonic Improvement Guarantee
Statement
Let denote the expected discounted return of policy . Define the KL-constrained update:
Then for sufficiently small , the true return satisfies for a constant depending on the discount factor and maximum advantage.
Intuition
If you stay close to the old policy (small KL divergence), the surrogate objective is a good local approximation to the true objective. Optimizing the surrogate within a trust region guarantees improvement on the true objective, minus a penalty proportional to the trust region size.
Proof Sketch
Kakade and Langford (2002) showed that where is the state visitation under . Schulman et al. replace with (introducing error controlled by KL divergence) and bound the approximation error using Pinsker's inequality and the bounded advantage assumption.
Why It Matters
This is one of the few results in deep RL with a theoretical improvement guarantee. It explains why constraining the policy update to a trust region prevents the catastrophic collapses seen with standard policy gradient methods (often called vanilla policy gradients), such as REINFORCE without a trust region.
Failure Mode
The guarantee requires exact KL constraint enforcement, which TRPO approximates with conjugate gradient and line search. In practice, the approximation can violate the constraint, breaking the guarantee. The computational cost of the second-order optimization is also significant.
TRPO in practice: The KL-constrained optimization is solved approximately using conjugate gradient to compute the natural gradient direction, followed by a line search to enforce the KL constraint. This requires computing Hessian-vector products of the KL divergence, making each update significantly more expensive than a standard gradient step.
PPO: Clipped Surrogate
PPO Clipped Surrogate Objective
Statement
PPO maximizes the clipped surrogate objective:
where .
Intuition
The clipped objective removes the incentive for the ratio to move far from 1. When the advantage is positive, the objective is , so increasing the probability beyond times the old probability gives no additional reward. When the advantage is negative, decreasing the probability beyond gives no additional reward. This creates a soft trust region without any KL computation.
Proof Sketch
This is a design choice, not a derived result. The justification is empirical: the clipped objective prevents large ratio updates (matching TRPO's intent) while requiring only first-order optimization. No monotonic improvement guarantee exists for PPO, unlike TRPO.
Why It Matters
PPO achieves comparable or better empirical performance to TRPO while being drastically simpler to implement. It requires no conjugate gradient solver, no Hessian-vector products, and no line search. This simplicity made it the default algorithm for RLHF in LLM training.
Failure Mode
PPO has no formal monotonic improvement guarantee. The clip can be too loose (allowing destabilizing updates) or too tight (slowing learning). In LLM training, PPO can overoptimize the reward model, producing outputs that score high on the proxy reward but degrade on true quality. This is the reward hacking failure mode.
PPO for LLM Training
In RLHF for language models, PPO optimizes a policy (the LM) against a learned reward model. The setup:
- Reference policy : the supervised fine-tuned model.
- Reward: where is the reward model and controls the KL penalty.
- PPO update: collect responses , compute advantages, update using .
The KL penalty to the reference policy prevents the LM from degenerating into reward-hacking outputs. The clip parameter (typically 0.2) prevents large single-step changes.
Common Confusions
PPO does not have a trust region guarantee
TRPO has a formal monotonic improvement result. PPO has a heuristic clip that empirically behaves like a trust region. The two are different algorithms with different theoretical status. PPO's dominance is entirely empirical.
The KL penalty in RLHF is not the TRPO KL constraint
TRPO constrains the KL between old and new policy within each update step. The RLHF KL penalty is between the current policy and the original reference model, applied as a reward shaping term. These serve different purposes: the TRPO KL stabilizes optimization, while the RLHF KL prevents mode collapse.
Canonical Examples
PPO clip mechanics
Suppose (a good action) and . If the ratio , the unclipped term is and the clipped term is . The min gives 1.2, so the gradient pushes toward 1.2 but no further. If (a bad action) and , the unclipped term is and the clipped term is . The min gives , so the gradient pushes toward 0.8 but no further.
Key Takeaways
- TRPO solves a KL-constrained surrogate objective with second-order methods and has a monotonic improvement guarantee
- PPO replaces the KL constraint with a clipped ratio, using only first-order optimization
- PPO has no formal improvement guarantee but matches or beats TRPO empirically
- In RLHF, PPO optimizes an LM against a reward model with a KL penalty to prevent reward hacking
- The clip parameter is standard; the RLHF KL coefficient requires tuning
Exercises
Problem
For PPO with , if the advantage and the current ratio , what is the value of the clipped surrogate ?
Problem
Explain why PPO can still make destructive updates despite the clip. Consider a scenario where multiple epochs of gradient updates on the same batch of data cause the ratio to drift far from 1, even though each individual step respects the clip.
Related Comparisons
References
Canonical:
- Schulman et al., "Trust Region Policy Optimization" (ICML 2015)
- Schulman et al., "Proximal Policy Optimization Algorithms" (2017)
Current:
- Ouyang et al., "Training language models to follow instructions with human feedback" (NeurIPS 2022)
- Engstrom et al., "Implementation Matters in Deep Policy Gradients" (ICLR 2020)
Next Topics
- DPO vs GRPO vs RL reasoning: modern alternatives to PPO for LLM alignment
- Post-training overview: the full pipeline from SFT through RLHF
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Policy Gradient TheoremLayer 3
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Actor-Critic MethodsLayer 3
- Q-LearningLayer 2
- Value Iteration and Policy IterationLayer 2