RL Theory
Actor-Critic Methods
The dominant paradigm for deep RL and LLM training: an actor (policy network) guided by a critic (value network), with advantage estimation, PPO clipping, and entropy regularization.
Prerequisites
Why This Matters
REINFORCE has unbiased gradients but impractical variance. Q-learning has low variance but cannot handle continuous actions and suffers from the deadly triad. Actor-critic methods combine the best of both: a critic (value function) reduces variance, while an actor (policy network) handles continuous and high-dimensional action spaces.
PPO, the most widely used actor-critic algorithm, trains the RLHF stage of ChatGPT, Claude, and virtually every instruction-tuned language model. SAC is the workhorse for continuous robotic control. Understanding actor-critic is understanding how modern AI systems are trained.
Mental Model
The actor proposes actions. The critic evaluates them. The actor improves by increasing the probability of actions the critic says are better than average (positive advantage) and decreasing the probability of actions that are worse than average (negative advantage). The critic improves by getting better at predicting how much total reward a state will yield.
This is policy iteration made differentiable: the critic does approximate policy evaluation, the actor does approximate policy improvement.
Formal Setup and Notation
We work in the MDP framework with a parameterized policy (the actor) and a learned value function (the critic). The advantage function is .
TD Error
The temporal difference (TD) error at time is:
This is a one-step estimate of the advantage: when .
Generalized Advantage Estimation
Generalized Advantage Estimation (GAE) is an exponentially weighted sum of multi-step TD errors:
where controls the bias-variance tradeoff:
- : one-step TD advantage (low variance, high bias)
- : Monte Carlo advantage (high variance, low bias)
Main Theorems
GAE Bias-Variance Tradeoff
Statement
Define the -step advantage estimator:
The GAE estimator is the -weighted average:
When (perfect critic), all estimators are unbiased. When , the bias of the -step estimator is . longer rollouts reduce bias from an imperfect critic. The variance grows with because more random rewards are included. interpolates between these extremes.
Intuition
A perfect critic gives zero-variance advantage estimates via one-step TD. An imperfect critic introduces bias, which you can reduce by relying more on actual observed rewards (larger ). GAE provides a smooth dial between trusting the critic () and trusting the data (). In practice, is a common default.
Why It Matters
Without GAE, you must choose between one-step TD (biased but stable) and Monte Carlo returns (unbiased but noisy). GAE gives a principled middle ground and is used in every modern policy gradient implementation, including PPO.
A2C and A3C
A2C (Advantage Actor-Critic) runs multiple parallel environments synchronously:
- Collect -step rollouts from parallel environments
- Compute GAE advantages for each rollout
- Update the actor:
- Update the critic:
A3C (Asynchronous Advantage Actor-Critic) uses asynchronous workers that each maintain a local copy of the parameters and send gradient updates to a shared parameter server. A3C was historically important (2016) but has been largely replaced by synchronous A2C and PPO, which are simpler and equally effective with modern hardware.
PPO: The Practical Standard
PPO Clipped Surrogate Objective
Statement
Define the probability ratio . The PPO-Clip objective is:
where (typically 0.1 to 0.2) limits how far the new policy can deviate from the old one in a single update.
Effect of the clipping:
- When (good action): the objective is , capping the benefit of increasing the action probability
- When (bad action): the objective is , capping the penalty reduction from decreasing the action probability
Intuition
TRPO solves a constrained optimization problem (maximize surrogate subject to KL constraint), which requires second-order methods. PPO approximates this with a simple clipping trick: if the policy ratio moves too far from 1, the gradient is zeroed out, preventing destructively large updates. The result is a first-order algorithm that is nearly as stable as TRPO and much simpler to implement.
Why It Matters
PPO is the algorithm used in RLHF for training instruction-following language models. Its simplicity (no KL constraint optimization, no conjugate gradients) makes it the default choice whenever policy optimization is needed. The clipping mechanism is the key innovation. It provides a soft trust region without the computational overhead of TRPO.
The full PPO loss combines three terms:
where is the value function loss and is an entropy bonus encouraging exploration.
SAC: Maximum Entropy RL
Soft Bellman Equation
Statement
In maximum entropy RL, the objective augments the standard return with an entropy term:
The corresponding soft Bellman equation is:
The optimal policy is the Boltzmann distribution: .
Intuition
SAC adds a reward for being uncertain. The agent is encouraged to find all good actions, not just the single best one. This leads to better exploration and more robust policies. The temperature controls how much randomness the policy retains. high means more exploration, low approaches the standard (deterministic) optimal policy.
Why It Matters
SAC achieves state-of-the-art performance on continuous control benchmarks (robotic locomotion, manipulation). Its key advantages: (1) it is off-policy (uses a replay buffer, so it is sample-efficient), (2) the entropy bonus provides automatic exploration, and (3) it avoids the brittle hyperparameter tuning of epsilon-greedy or noise-based exploration.
SAC maintains three networks: a policy (actor), two Q-functions (critics, using the double-Q trick to prevent overestimation), and optionally learns automatically by targeting a desired entropy level.
Why Actor-Critic Dominates
| Setting | Algorithm | Reason |
|---|---|---|
| LLM training (RLHF) | PPO | Discrete tokens, on-policy stability |
| Robotic control | SAC | Continuous actions, sample efficiency |
| Game playing | A2C / PPO | Parallel environments, scalability |
| Fine-tuning with verifiers | GRPO / PPO | Reward model as critic |
The actor-critic framework is flexible enough to accommodate all these settings because it separates the policy representation (actor) from the value estimation (critic). You can swap architectures, loss functions, and training procedures while keeping the same fundamental structure.
Common Confusions
The critic is not the reward model
In RLHF, the reward model scores outputs. The critic estimates the expected cumulative reward from a state under the current policy. The reward model is fixed after training; the critic is updated continuously during RL training. They serve different roles: the reward model defines the objective, the critic helps optimize it efficiently.
PPO is not the same as TRPO
TRPO solves a constrained optimization problem using conjugate gradients and line search. PPO uses a clipped surrogate objective with standard SGD. PPO does not enforce a hard KL constraint. The clipping is a soft approximation. In practice, PPO can violate the implicit trust region, which is both a feature (faster progress) and a risk (instability).
Actor-critic is not just REINFORCE with a baseline
REINFORCE with a baseline uses a learned value function to reduce variance but still uses Monte Carlo returns for the policy gradient. Actor-critic uses the critic for bootstrapping (TD estimates), which introduces bias but dramatically reduces variance. The distinction is whether the critic is used only as a baseline or also as a bootstrap target.
Summary
- Actor-critic = policy network (actor) + value network (critic)
- Advantage centers the policy gradient, reducing variance
- GAE: exponentially weighted multi-step TD errors, controls bias-variance
- PPO: clipped surrogate objective, simple first-order trust region approximation
- SAC: maximum entropy RL, adds entropy bonus for exploration, off-policy
- Actor-critic is the dominant paradigm: PPO for LLMs, SAC for robotics
Exercises
Problem
Compute the GAE advantage for a 3-step trajectory with , , , .
Problem
In the PPO clipped objective with , if and , what is the clipped objective value? What gradient signal does the actor receive?
Problem
Why does SAC use two Q-networks and take the minimum of their predictions? Relate this to the overestimation problem in Q-learning.
Related Comparisons
References
Canonical:
- Konda & Tsitsiklis, "Actor-Critic Algorithms" (NeurIPS 2000)
- Schulman, Moritz, Levine, Jordan, Abbeel, "High-Dimensional Continuous Control Using GAE" (ICLR 2016)
Current:
- Schulman et al., "Proximal Policy Optimization Algorithms" (2017). PPO
- Haarnoja et al., "Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL" (ICML 2018). SAC
- Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback" (2022). RLHF with PPO
Next Topics
The natural next steps from actor-critic methods:
- RLHF and alignment: applying PPO to language model training from human preferences
- DPO vs. GRPO vs. RL reasoning: modern alternatives and extensions to the PPO-based RLHF pipeline
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Policy Gradient TheoremLayer 3
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Q-LearningLayer 2
- Value Iteration and Policy IterationLayer 2