Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Policy Optimization: PPO and TRPO

Trust region and clipped surrogate methods for stable policy optimization: TRPO enforces a KL constraint for monotonic improvement, PPO replaces it with a simpler clipped objective, and PPO became the default optimizer for LLM post-training.

AdvancedTier 2Stable~55 min
0

Why This Matters

Vanilla policy gradients (REINFORCE) suffer from high variance and destructive large updates. A single bad update can collapse a good policy. TRPO and PPO solve this by constraining how far the new policy can deviate from the old one.

PPO is the workhorse of modern RL. It was used to train the reward models and policies in RLHF for ChatGPT and most subsequent LLM post-training pipelines. Understanding the clipped surrogate objective is necessary for understanding why PPO works and where it breaks.

Formal Setup

Let πθ\pi_\theta be a parameterized policy and πθold\pi_{\theta_{\text{old}}} the current policy from which we collected trajectories. Define the probability ratio:

rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}

and the advantage A^t\hat{A}_t estimated from the collected data (typically via GAE, generalized advantage estimation).

Definition

Surrogate Objective

The conservative policy iteration (CPI) surrogate objective is:

LCPI(θ)=Et[rt(θ)A^t]L^{\text{CPI}}(\theta) = \mathbb{E}_t \left[ r_t(\theta) \hat{A}_t \right]

Maximizing this with respect to θ\theta using data from πθold\pi_{\theta_{\text{old}}} is equivalent to a policy gradient step via importance sampling. Without constraints, large rt(θ)r_t(\theta) can cause catastrophic updates.

TRPO: Trust Region Approach

Theorem

TRPO Monotonic Improvement Guarantee

Statement

Let η(π)\eta(\pi) denote the expected discounted return of policy π\pi. Define the KL-constrained update:

θk+1=argmaxθEt[rt(θ)A^t]\theta_{k+1} = \arg\max_\theta \, \mathbb{E}_t[r_t(\theta) \hat{A}_t] subject to Et[KL[πθold(st)πθ(st)]]δ\text{subject to } \mathbb{E}_t[\mathrm{KL}[\pi_{\theta_{\text{old}}}(\cdot | s_t) \| \pi_\theta(\cdot | s_t)]] \leq \delta

Then for sufficiently small δ\delta, the true return satisfies η(πθk+1)η(πθk)Cδ\eta(\pi_{\theta_{k+1}}) \geq \eta(\pi_{\theta_k}) - C\delta for a constant CC depending on the discount factor and maximum advantage.

Intuition

If you stay close to the old policy (small KL divergence), the surrogate objective is a good local approximation to the true objective. Optimizing the surrogate within a trust region guarantees improvement on the true objective, minus a penalty proportional to the trust region size.

Proof Sketch

Kakade and Langford (2002) showed that η(π)=η(π)+Esdπ[Eaπ[Aπ(s,a)]]\eta(\pi') = \eta(\pi) + \mathbb{E}_{s \sim d^{\pi'}}[\mathbb{E}_{a \sim \pi'}[A^\pi(s,a)]] where dπd^{\pi'} is the state visitation under π\pi'. Schulman et al. replace dπd^{\pi'} with dπd^{\pi} (introducing error controlled by KL divergence) and bound the approximation error using Pinsker's inequality and the bounded advantage assumption.

Why It Matters

This is one of the few results in deep RL with a theoretical improvement guarantee. It explains why constraining the policy update to a trust region prevents the catastrophic collapses seen with standard policy gradient methods (often called vanilla policy gradients), such as REINFORCE without a trust region.

Failure Mode

The guarantee requires exact KL constraint enforcement, which TRPO approximates with conjugate gradient and line search. In practice, the approximation can violate the constraint, breaking the guarantee. The computational cost of the second-order optimization is also significant.

TRPO in practice: The KL-constrained optimization is solved approximately using conjugate gradient to compute the natural gradient direction, followed by a line search to enforce the KL constraint. This requires computing Hessian-vector products of the KL divergence, making each update significantly more expensive than a standard gradient step.

PPO: Clipped Surrogate

Proposition

PPO Clipped Surrogate Objective

Statement

PPO maximizes the clipped surrogate objective:

LCLIP(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

where clip(r,1ϵ,1+ϵ)=max(1ϵ,min(r,1+ϵ))\text{clip}(r, 1-\epsilon, 1+\epsilon) = \max(1-\epsilon, \min(r, 1+\epsilon)).

Intuition

The clipped objective removes the incentive for the ratio rt(θ)r_t(\theta) to move far from 1. When the advantage is positive, the objective is min(rtA^t,(1+ϵ)A^t)\min(r_t \hat{A}_t, (1+\epsilon)\hat{A}_t), so increasing the probability beyond 1+ϵ1+\epsilon times the old probability gives no additional reward. When the advantage is negative, decreasing the probability beyond 1ϵ1-\epsilon gives no additional reward. This creates a soft trust region without any KL computation.

Proof Sketch

This is a design choice, not a derived result. The justification is empirical: the clipped objective prevents large ratio updates (matching TRPO's intent) while requiring only first-order optimization. No monotonic improvement guarantee exists for PPO, unlike TRPO.

Why It Matters

PPO achieves comparable or better empirical performance to TRPO while being drastically simpler to implement. It requires no conjugate gradient solver, no Hessian-vector products, and no line search. This simplicity made it the default algorithm for RLHF in LLM training.

Failure Mode

PPO has no formal monotonic improvement guarantee. The clip can be too loose (allowing destabilizing updates) or too tight (slowing learning). In LLM training, PPO can overoptimize the reward model, producing outputs that score high on the proxy reward but degrade on true quality. This is the reward hacking failure mode.

PPO for LLM Training

In RLHF for language models, PPO optimizes a policy (the LM) against a learned reward model. The setup:

  1. Reference policy πref\pi_{\text{ref}}: the supervised fine-tuned model.
  2. Reward: r(x,y)=Rϕ(x,y)βKL[πθ(x)πref(x)]r(x, y) = R_\phi(x, y) - \beta \mathrm{KL}[\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)] where RϕR_\phi is the reward model and β\beta controls the KL penalty.
  3. PPO update: collect responses yπθold(x)y \sim \pi_{\theta_{\text{old}}}(\cdot|x), compute advantages, update θ\theta using LCLIPL^{\text{CLIP}}.

The KL penalty to the reference policy prevents the LM from degenerating into reward-hacking outputs. The clip parameter ϵ\epsilon (typically 0.2) prevents large single-step changes.

Common Confusions

Watch Out

PPO does not have a trust region guarantee

TRPO has a formal monotonic improvement result. PPO has a heuristic clip that empirically behaves like a trust region. The two are different algorithms with different theoretical status. PPO's dominance is entirely empirical.

Watch Out

The KL penalty in RLHF is not the TRPO KL constraint

TRPO constrains the KL between old and new policy within each update step. The RLHF KL penalty is between the current policy and the original reference model, applied as a reward shaping term. These serve different purposes: the TRPO KL stabilizes optimization, while the RLHF KL prevents mode collapse.

Canonical Examples

Example

PPO clip mechanics

Suppose A^t=+1\hat{A}_t = +1 (a good action) and ϵ=0.2\epsilon = 0.2. If the ratio rt=1.5r_t = 1.5, the unclipped term is 1.5×1=1.51.5 \times 1 = 1.5 and the clipped term is 1.2×1=1.21.2 \times 1 = 1.2. The min gives 1.2, so the gradient pushes rtr_t toward 1.2 but no further. If A^t=1\hat{A}_t = -1 (a bad action) and rt=0.7r_t = 0.7, the unclipped term is 0.7-0.7 and the clipped term is 0.8-0.8. The min gives 0.8-0.8, so the gradient pushes rtr_t toward 0.8 but no further.

Key Takeaways

  • TRPO solves a KL-constrained surrogate objective with second-order methods and has a monotonic improvement guarantee
  • PPO replaces the KL constraint with a clipped ratio, using only first-order optimization
  • PPO has no formal improvement guarantee but matches or beats TRPO empirically
  • In RLHF, PPO optimizes an LM against a reward model with a KL penalty to prevent reward hacking
  • The clip parameter ϵ=0.2\epsilon = 0.2 is standard; the RLHF KL coefficient β\beta requires tuning

Exercises

ExerciseCore

Problem

For PPO with ϵ=0.2\epsilon = 0.2, if the advantage A^t=2\hat{A}_t = -2 and the current ratio rt(θ)=1.4r_t(\theta) = 1.4, what is the value of the clipped surrogate min(rtA^t,clip(rt,0.8,1.2)A^t)\min(r_t \hat{A}_t, \text{clip}(r_t, 0.8, 1.2) \hat{A}_t)?

ExerciseAdvanced

Problem

Explain why PPO can still make destructive updates despite the clip. Consider a scenario where multiple epochs of gradient updates on the same batch of data cause the ratio rtr_t to drift far from 1, even though each individual step respects the clip.

Related Comparisons

References

Canonical:

  • Schulman et al., "Trust Region Policy Optimization" (ICML 2015)
  • Schulman et al., "Proximal Policy Optimization Algorithms" (2017)

Current:

  • Ouyang et al., "Training language models to follow instructions with human feedback" (NeurIPS 2022)
  • Engstrom et al., "Implementation Matters in Deep Policy Gradients" (ICLR 2020)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics