Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Actor-Critic Methods

The dominant paradigm for deep RL and LLM training: an actor (policy network) guided by a critic (value network), with advantage estimation, PPO clipping, and entropy regularization.

AdvancedTier 2Stable~55 min
0

Why This Matters

REINFORCE has unbiased gradients but impractical variance. Q-learning has low variance but cannot handle continuous actions and suffers from the deadly triad. Actor-critic methods combine the best of both: a critic (value function) reduces variance, while an actor (policy network) handles continuous and high-dimensional action spaces.

PPO, the most widely used actor-critic algorithm, trains the RLHF stage of ChatGPT, Claude, and virtually every instruction-tuned language model. SAC is the workhorse for continuous robotic control. Understanding actor-critic is understanding how modern AI systems are trained.

Mental Model

The actor proposes actions. The critic evaluates them. The actor improves by increasing the probability of actions the critic says are better than average (positive advantage) and decreasing the probability of actions that are worse than average (negative advantage). The critic improves by getting better at predicting how much total reward a state will yield.

This is policy iteration made differentiable: the critic does approximate policy evaluation, the actor does approximate policy improvement.

Formal Setup and Notation

We work in the MDP framework with a parameterized policy πθ(as)\pi_\theta(a|s) (the actor) and a learned value function Vϕ(s)V_\phi(s) (the critic). The advantage function is Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s).

Definition

TD Error

The temporal difference (TD) error at time tt is:

δt=rt+γVϕ(st+1)Vϕ(st)\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)

This is a one-step estimate of the advantage: E[δtst,at]Aπ(st,at)\mathbb{E}[\delta_t | s_t, a_t] \approx A^\pi(s_t, a_t) when VϕVπV_\phi \approx V^\pi.

Definition

Generalized Advantage Estimation

Generalized Advantage Estimation (GAE) is an exponentially weighted sum of multi-step TD errors:

A^tGAE(γ,λ)=l=0(γλ)lδt+l\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}

where λ[0,1]\lambda \in [0,1] controls the bias-variance tradeoff:

  • λ=0\lambda = 0: one-step TD advantage A^t=δt\hat{A}_t = \delta_t (low variance, high bias)
  • λ=1\lambda = 1: Monte Carlo advantage A^t=GtVϕ(st)\hat{A}_t = G_t - V_\phi(s_t) (high variance, low bias)

Main Theorems

Proposition

GAE Bias-Variance Tradeoff

Statement

Define the kk-step advantage estimator:

A^t(k)=l=0k1γlδt+l=Vϕ(st)+rt+γrt+1++γk1rt+k1+γkVϕ(st+k)\hat{A}_t^{(k)} = \sum_{l=0}^{k-1} \gamma^l \delta_{t+l} = -V_\phi(s_t) + r_t + \gamma r_{t+1} + \cdots + \gamma^{k-1} r_{t+k-1} + \gamma^k V_\phi(s_{t+k})

The GAE estimator is the λ\lambda-weighted average:

A^tGAE=(1λ)k=1λk1A^t(k)=l=0(γλ)lδt+l\hat{A}_t^{\text{GAE}} = (1-\lambda) \sum_{k=1}^{\infty} \lambda^{k-1} \hat{A}_t^{(k)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}

When Vϕ=VπV_\phi = V^\pi (perfect critic), all estimators are unbiased. When VϕVπV_\phi \neq V^\pi, the bias of the kk-step estimator is O(γkVϕVπ)O(\gamma^k \|V_\phi - V^\pi\|_\infty). longer rollouts reduce bias from an imperfect critic. The variance grows with kk because more random rewards are included. λ\lambda interpolates between these extremes.

Intuition

A perfect critic gives zero-variance advantage estimates via one-step TD. An imperfect critic introduces bias, which you can reduce by relying more on actual observed rewards (larger λ\lambda). GAE provides a smooth dial between trusting the critic (λ=0\lambda = 0) and trusting the data (λ=1\lambda = 1). In practice, λ=0.95\lambda = 0.95 is a common default.

Why It Matters

Without GAE, you must choose between one-step TD (biased but stable) and Monte Carlo returns (unbiased but noisy). GAE gives a principled middle ground and is used in every modern policy gradient implementation, including PPO.

A2C and A3C

A2C (Advantage Actor-Critic) runs multiple parallel environments synchronously:

  1. Collect nn-step rollouts from KK parallel environments
  2. Compute GAE advantages A^t\hat{A}_t for each rollout
  3. Update the actor: θθ+αtθlogπθ(atst)A^t\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \hat{A}_t
  4. Update the critic: ϕϕβtϕ(Vϕ(st)Gt)2\phi \leftarrow \phi - \beta \sum_t \nabla_\phi (V_\phi(s_t) - G_t)^2

A3C (Asynchronous Advantage Actor-Critic) uses asynchronous workers that each maintain a local copy of the parameters and send gradient updates to a shared parameter server. A3C was historically important (2016) but has been largely replaced by synchronous A2C and PPO, which are simpler and equally effective with modern hardware.

PPO: The Practical Standard

Proposition

PPO Clipped Surrogate Objective

Statement

Define the probability ratio rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}. The PPO-Clip objective is:

LCLIP(θ)=Et[min(rt(θ)A^t,    clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\Big( r_t(\theta) \hat{A}_t, \;\; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \Big) \right]

where ϵ\epsilon (typically 0.1 to 0.2) limits how far the new policy can deviate from the old one in a single update.

Effect of the clipping:

  • When A^t>0\hat{A}_t > 0 (good action): the objective is min(rtA^t,(1+ϵ)A^t)\min(r_t \hat{A}_t, (1+\epsilon)\hat{A}_t), capping the benefit of increasing the action probability
  • When A^t<0\hat{A}_t < 0 (bad action): the objective is min(rtA^t,(1ϵ)A^t)\min(r_t \hat{A}_t, (1-\epsilon)\hat{A}_t), capping the penalty reduction from decreasing the action probability

Intuition

TRPO solves a constrained optimization problem (maximize surrogate subject to KL constraint), which requires second-order methods. PPO approximates this with a simple clipping trick: if the policy ratio moves too far from 1, the gradient is zeroed out, preventing destructively large updates. The result is a first-order algorithm that is nearly as stable as TRPO and much simpler to implement.

Why It Matters

PPO is the algorithm used in RLHF for training instruction-following language models. Its simplicity (no KL constraint optimization, no conjugate gradients) makes it the default choice whenever policy optimization is needed. The clipping mechanism is the key innovation. It provides a soft trust region without the computational overhead of TRPO.

The full PPO loss combines three terms:

L(θ,ϕ)=LCLIP(θ)c1LVF(ϕ)+c2H[πθ]L(\theta, \phi) = L^{\text{CLIP}}(\theta) - c_1 \, L^{\text{VF}}(\phi) + c_2 \, H[\pi_\theta]

where LVF=(Vϕ(st)Gt)2L^{\text{VF}} = (V_\phi(s_t) - G_t)^2 is the value function loss and H[πθ]H[\pi_\theta] is an entropy bonus encouraging exploration.

SAC: Maximum Entropy RL

Proposition

Soft Bellman Equation

Statement

In maximum entropy RL, the objective augments the standard return with an entropy term:

J(π)=t=0γtE[rt+αH[π(st)]]J(\pi) = \sum_{t=0}^{\infty} \gamma^t \mathbb{E}\left[ r_t + \alpha \, H[\pi(\cdot|s_t)] \right]

The corresponding soft Bellman equation is:

Vπ(s)=Eaπ[Qπ(s,a)αlogπ(as)]V^\pi(s) = \mathbb{E}_{a \sim \pi}\left[ Q^\pi(s,a) - \alpha \log \pi(a|s) \right]

Qπ(s,a)=R(s,a)+γEsP[Vπ(s)]Q^\pi(s,a) = R(s,a) + \gamma \mathbb{E}_{s' \sim P}\left[ V^\pi(s') \right]

The optimal policy is the Boltzmann distribution: π(as)exp(Q(s,a)/α)\pi^*(a|s) \propto \exp(Q^*(s,a)/\alpha).

Intuition

SAC adds a reward for being uncertain. The agent is encouraged to find all good actions, not just the single best one. This leads to better exploration and more robust policies. The temperature α\alpha controls how much randomness the policy retains. high α\alpha means more exploration, low α\alpha approaches the standard (deterministic) optimal policy.

Why It Matters

SAC achieves state-of-the-art performance on continuous control benchmarks (robotic locomotion, manipulation). Its key advantages: (1) it is off-policy (uses a replay buffer, so it is sample-efficient), (2) the entropy bonus provides automatic exploration, and (3) it avoids the brittle hyperparameter tuning of epsilon-greedy or noise-based exploration.

SAC maintains three networks: a policy πθ\pi_\theta (actor), two Q-functions Qϕ1,Qϕ2Q_{\phi_1}, Q_{\phi_2} (critics, using the double-Q trick to prevent overestimation), and optionally learns α\alpha automatically by targeting a desired entropy level.

Why Actor-Critic Dominates

SettingAlgorithmReason
LLM training (RLHF)PPODiscrete tokens, on-policy stability
Robotic controlSACContinuous actions, sample efficiency
Game playingA2C / PPOParallel environments, scalability
Fine-tuning with verifiersGRPO / PPOReward model as critic

The actor-critic framework is flexible enough to accommodate all these settings because it separates the policy representation (actor) from the value estimation (critic). You can swap architectures, loss functions, and training procedures while keeping the same fundamental structure.

Common Confusions

Watch Out

The critic is not the reward model

In RLHF, the reward model scores outputs. The critic estimates the expected cumulative reward from a state under the current policy. The reward model is fixed after training; the critic is updated continuously during RL training. They serve different roles: the reward model defines the objective, the critic helps optimize it efficiently.

Watch Out

PPO is not the same as TRPO

TRPO solves a constrained optimization problem using conjugate gradients and line search. PPO uses a clipped surrogate objective with standard SGD. PPO does not enforce a hard KL constraint. The clipping is a soft approximation. In practice, PPO can violate the implicit trust region, which is both a feature (faster progress) and a risk (instability).

Watch Out

Actor-critic is not just REINFORCE with a baseline

REINFORCE with a baseline uses a learned value function to reduce variance but still uses Monte Carlo returns for the policy gradient. Actor-critic uses the critic for bootstrapping (TD estimates), which introduces bias but dramatically reduces variance. The distinction is whether the critic is used only as a baseline or also as a bootstrap target.

Summary

  • Actor-critic = policy network (actor) + value network (critic)
  • Advantage A(s,a)=Q(s,a)V(s)A(s,a) = Q(s,a) - V(s) centers the policy gradient, reducing variance
  • GAE: exponentially weighted multi-step TD errors, λ\lambda controls bias-variance
  • PPO: clipped surrogate objective, simple first-order trust region approximation
  • SAC: maximum entropy RL, adds entropy bonus for exploration, off-policy
  • Actor-critic is the dominant paradigm: PPO for LLMs, SAC for robotics

Exercises

ExerciseCore

Problem

Compute the GAE advantage A^0GAE(γ,λ)\hat{A}_0^{\text{GAE}(\gamma, \lambda)} for a 3-step trajectory with r0=1,r1=0,r2=1r_0 = 1, r_1 = 0, r_2 = 1, Vϕ(s0)=2,Vϕ(s1)=1,Vϕ(s2)=1.5,Vϕ(s3)=0V_\phi(s_0) = 2, V_\phi(s_1) = 1, V_\phi(s_2) = 1.5, V_\phi(s_3) = 0, γ=0.9\gamma = 0.9, λ=0.95\lambda = 0.95.

ExerciseCore

Problem

In the PPO clipped objective with ϵ=0.2\epsilon = 0.2, if rt(θ)=1.5r_t(\theta) = 1.5 and A^t=2\hat{A}_t = 2, what is the clipped objective value? What gradient signal does the actor receive?

ExerciseAdvanced

Problem

Why does SAC use two Q-networks and take the minimum of their predictions? Relate this to the overestimation problem in Q-learning.

Related Comparisons

References

Canonical:

  • Konda & Tsitsiklis, "Actor-Critic Algorithms" (NeurIPS 2000)
  • Schulman, Moritz, Levine, Jordan, Abbeel, "High-Dimensional Continuous Control Using GAE" (ICLR 2016)

Current:

  • Schulman et al., "Proximal Policy Optimization Algorithms" (2017). PPO
  • Haarnoja et al., "Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL" (ICML 2018). SAC
  • Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback" (2022). RLHF with PPO

Next Topics

The natural next steps from actor-critic methods:

  • RLHF and alignment: applying PPO to language model training from human preferences
  • DPO vs. GRPO vs. RL reasoning: modern alternatives and extensions to the PPO-based RLHF pipeline

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics