Policy Optimization: PPO and TRPO

Sneiderman, Robby

RL Theory

Policy Optimization: PPO and TRPO

Trust region and clipped surrogate methods for stable policy optimization: TRPO enforces a KL constraint for monotonic improvement, PPO replaces it with a simpler clipped objective, and PPO became the default optimizer for LLM post-training.

AdvancedTier 2StableSupporting~55 min

Prerequisites

Policy Gradient Theorem Actor Critic Methods Ddpg Offline Reinforcement Learning

Prereq Map

Why This Matters

Vanilla policy gradients (REINFORCE) suffer from high variance and destructive large updates. A single bad update can collapse a good policy. TRPO and PPO solve this by constraining how far the new policy can deviate from the old one.

PPO is the workhorse of modern RL. It was used to train the policy in RLHF for ChatGPT and most subsequent LLM post-training pipelines. The reward model itself is trained separately via Bradley-Terry maximum likelihood on pairwise preferences (see reward models and verifiers); PPO is the optimizer for the policy step that consumes the reward model's score. Understanding the clipped surrogate objective is necessary for understanding why PPO works and where it breaks.

Formal Setup

Let $\pi_\theta$ be a parameterized policy and $\pi_{\theta_{\text{old}}}$ the current policy from which we collected trajectories. Define the probability ratio:

$r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}$

and the advantage $\hat{A}_t$ estimated from the collected data (typically via GAE, generalized advantage estimation).

Definition

Surrogate Objective $L^{CPI} (θ)$

The conservative policy iteration (CPI) surrogate objective is:

$L^{\text{CPI}}(\theta) = \mathbb{E}_t \left[ r_t(\theta) \hat{A}_t \right]$

Maximizing this with respect to $\theta$ using data from $\pi_{\theta_{\text{old}}}$ is equivalent to a policy gradient step via importance sampling. Without constraints, large $r_t(\theta)$ can cause catastrophic updates.

TRPO: Trust Region Approach

Theorem

TRPO Monotonic Improvement Guarantee

Statement

Let $\eta(\pi)$ denote the expected discounted return of policy $\pi$ . Define the KL-constrained update:

$\theta_{k+1} = \arg\max_\theta \, \mathbb{E}_t[r_t(\theta) \hat{A}_t]$ $\text{subject to } \mathbb{E}_t[\mathrm{KL}[\pi_{\theta_{\text{old}}}(\cdot | s_t) \| \pi_\theta(\cdot | s_t)]] \leq \delta$

Then for sufficiently small $\delta$ , the true return satisfies $\eta(\pi_{\theta_{k+1}}) \geq \eta(\pi_{\theta_k}) - C\delta$ for a constant $C$ depending on the discount factor and maximum advantage.

Intuition

If you stay close to the old policy (small KL divergence), the surrogate objective is a good local approximation to the true objective. Optimizing the surrogate within a trust region guarantees improvement on the true objective, minus a penalty proportional to the trust region size.

Proof Sketch

Kakade and Langford (2002) showed that $\eta(\pi') = \eta(\pi) + \mathbb{E}_{s \sim d^{\pi'}}[\mathbb{E}_{a \sim \pi'}[A^\pi(s,a)]]$ where $d^{\pi'}$ is the state visitation under $\pi'$ . Schulman et al. replace $d^{\pi'}$ with $d^{\pi}$ (introducing error controlled by KL divergence) and bound the approximation error using Pinsker's inequality and the bounded advantage assumption.

Why It Matters

This is one of the few results in deep RL with a theoretical improvement guarantee. It explains why constraining the policy update to a trust region prevents the catastrophic collapses seen with standard policy gradient methods (often called vanilla policy gradients), such as REINFORCE without a trust region.

Failure Mode

The guarantee requires exact KL constraint enforcement, which TRPO approximates with conjugate gradient and line search. In practice, the approximation can violate the constraint, breaking the guarantee. The computational cost of the second-order optimization is also significant.

report a correction →

TRPO in practice: The KL-constrained optimization is solved approximately using conjugate gradient to compute the natural gradient direction, followed by a line search to enforce the KL constraint. This requires computing Hessian-vector products of the KL divergence, making each update significantly more expensive than a standard gradient step.

PPO: Clipped Surrogate

Proposition

PPO Clipped Surrogate Objective

Statement

PPO maximizes the clipped surrogate objective:

$L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$

where $\text{clip}(r, 1-\epsilon, 1+\epsilon) = \max(1-\epsilon, \min(r, 1+\epsilon))$ .

Intuition

The clipped objective removes the incentive for the ratio $r_t(\theta)$ to move far from 1. When the advantage is positive, the objective is $\min(r_t \hat{A}_t, (1+\epsilon)\hat{A}_t)$ , so increasing the probability beyond $1+\epsilon$ times the old probability gives no additional reward. When the advantage is negative, decreasing the probability beyond $1-\epsilon$ gives no additional reward. This creates a soft trust region without any KL computation.

Proof Sketch

This is a design choice, not a derived result. The justification is empirical: the clipped objective prevents large ratio updates (matching TRPO's intent) while requiring only first-order optimization. No monotonic improvement guarantee exists for PPO, unlike TRPO.

Why It Matters

PPO achieves comparable or better empirical performance to TRPO while being drastically simpler to implement. It requires no conjugate gradient solver, no Hessian-vector products, and no line search. This simplicity made it the default algorithm for RLHF in LLM training.

Failure Mode

PPO has no formal monotonic improvement guarantee. The clip can be too loose (allowing destabilizing updates) or too tight (slowing learning). In LLM training, PPO can overoptimize the reward model, producing outputs that score high on the proxy reward but degrade on true quality. This is the reward hacking failure mode.

report a correction →

PPO for LLM Training

In RLHF for language models, PPO optimizes a policy (the LM) against a learned reward model. The setup:

Reference policy $\pi_{\text{ref}}$ : the supervised fine-tuned model.
Reward: $r(x, y) = R_\phi(x, y) - \beta \mathrm{KL}[\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)]$ where $R_\phi$ is the reward model and $\beta$ controls the KL penalty.
PPO update: collect responses $y \sim \pi_{\theta_{\text{old}}}(\cdot|x)$ , compute advantages, update $\theta$ using $L^{\text{CLIP}}$ .

The KL penalty to the reference policy prevents the LM from degenerating into reward-hacking outputs. The clip parameter $\epsilon$ (typically 0.2) prevents large single-step changes.

Common Confusions

Watch Out

PPO does not have a trust region guarantee

TRPO has a formal monotonic improvement result. PPO has a heuristic clip that empirically behaves like a trust region. The two are different algorithms with different theoretical status. PPO's dominance is entirely empirical.

Watch Out

The KL penalty in RLHF is not the TRPO KL constraint

TRPO constrains the KL between old and new policy within each update step. The RLHF KL penalty is between the current policy and the original reference model, applied as a reward shaping term. These serve different purposes: the TRPO KL stabilizes optimization, while the RLHF KL prevents mode collapse.

Canonical Examples

Example

PPO clip mechanics

Suppose $\hat{A}_t = +1$ (a good action) and $\epsilon = 0.2$ . If the ratio $r_t = 1.5$ , the unclipped term is $1.5 \times 1 = 1.5$ and the clipped term is $1.2 \times 1 = 1.2$ . The min gives 1.2, so the gradient pushes $r_t$ toward 1.2 but no further. If $\hat{A}_t = -1$ (a bad action) and $r_t = 0.7$ , the unclipped term is $-0.7$ and the clipped term is $-0.8$ . The min gives $-0.8$ , so the gradient pushes $r_t$ toward 0.8 but no further.

Summary

TRPO solves a KL-constrained surrogate objective with second-order methods and has a monotonic improvement guarantee
PPO replaces the KL constraint with a clipped ratio, using only first-order optimization
PPO has no formal improvement guarantee but matches or beats TRPO empirically
In RLHF, PPO optimizes an LM against a reward model with a KL penalty to prevent reward hacking
The clip parameter $\epsilon = 0.2$ is standard; the RLHF KL coefficient $\beta$ requires tuning

Exercises

ExerciseCore

Problem

For PPO with $\epsilon = 0.2$ , if the advantage $\hat{A}_t = -2$ and the current ratio $r_t(\theta) = 1.4$ , what is the value of the clipped surrogate $\min(r_t \hat{A}_t, \text{clip}(r_t, 0.8, 1.2) \hat{A}_t)$ ?

ExerciseAdvanced

Problem

Explain why PPO can still make destructive updates despite the clip. Consider a scenario where multiple epochs of gradient updates on the same batch of data cause the ratio $r_t$ to drift far from 1, even though each individual step respects the clip.

Related Comparisons

PPO vs. SAC

References

Canonical:

Schulman et al., "Trust Region Policy Optimization" (ICML 2015)
Schulman et al., "Proximal Policy Optimization Algorithms" (2017)

Current:

Ouyang et al., "Training language models to follow instructions with human feedback" (NeurIPS 2022)
Engstrom et al., "Implementation Matters in Deep Policy Gradients" (ICLR 2020)

Next Topics

DPO vs GRPO vs RL reasoning: modern alternatives to PPO for LLM alignment
Post-training overview: the full pipeline from SFT through RLHF

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Policy Gradient Theoremlayer 3 · tier 1
Actor-Critic Methodslayer 3 · tier 2
DDPG: Deep Deterministic Policy Gradientlayer 3 · tier 2
Offline Reinforcement Learninglayer 3 · tier 2
TD3: Twin Delayed Deep Deterministic Policy Gradientlayer 3 · tier 2

Derived topics

2

DPO vs GRPO vs RL for Reasoninglayer 5 · tier 2
Post-Training Overviewlayer 5 · tier 2

Graph-backed continuations

DPO vs GRPO vs RL for Reasoning Post-Training Overview