RL Theory
Policy Gradient Theorem
The fundamental result enabling gradient-based optimization of parameterized policies: the policy gradient theorem and the algorithms it spawns.
Prerequisites
Why This Matters
The Bellman equations give us a way to solve MDPs when we know the transition model and the state space is small. But what about Atari games with pixel inputs? Robot control with continuous actions? Language model fine-tuning?
In these settings, we parameterize the policy as (e.g., a neural network) and optimize the parameters by gradient ascent on the expected return. The policy gradient theorem tells us how to compute this gradient. even though the expected return depends on the entire trajectory distribution, which itself depends on in a complicated way.
This theorem is the foundation of REINFORCE, actor-critic methods, PPO, and the RLHF pipeline used to train modern language models.
Mental Model
We want to maximize , the expected total reward under the policy. The challenge is that changing changes the probability of every trajectory , which is a product of many policy and transition probabilities.
The log-derivative trick resolves this: instead of differentiating through the trajectory distribution directly, we express the gradient as an expectation under the current policy. This means we can estimate the gradient by sampling trajectories. no model of the environment needed.
Formal Setup and Notation
We work in the MDP framework with a parameterized stochastic policy . A trajectory is .
Policy Objective
The policy objective is the expected discounted return:
where is the discounted state-visitation distribution. This is the distribution that makes the Sutton et al. (2000) policy gradient theorem exact. In most deep-RL implementations the gradient is estimated using the undiscounted empirical distribution of states visited along sampled trajectories, which introduces a well-documented bias. See Thomas (2014), "Bias in Natural Actor-Critic Algorithms," and Nota and Thomas (2020), "Is the Policy Gradient a Gradient?"
Advantage Function
The advantage function measures how much better action is compared to the average action under :
By construction, for all .
Main Theorems
Policy Gradient Theorem
Statement
The gradient of the policy objective is:
Equivalently, summing over time steps along a trajectory:
Intuition
points in the direction that increases the probability of action in state . The theorem says: weight this direction by how good that action is (-value), and average over the states and actions you actually visit. Actions that lead to high returns get their probabilities increased; actions that lead to low returns get decreased.
Proof Sketch
Write and differentiate. Using :
The term has no -dependence, so . Substituting gives a one-step recursion on .
Unrolling the recursion yields
where is an unnormalized discounted visit count. The normalization is exact: , since . The factor matters. Sutton, McAllester, Singh, Mansour 2000 state it explicitly, but many practical implementations drop it and use an estimator based on the undiscounted empirical state distribution, which is biased. See the Thomas (2014) and Nota and Thomas (2020) references above.
The crucial thing this calculation delivers is that the messy and terms telescope into the discounted state-visitation weights; the gradient acts on only.
Now apply the log-derivative trick and rescale by to the distribution :
The log-derivative step is what turns a sum over the discrete action set into an expectation you can estimate with samples, which is why this theorem is the starting point for every practical policy-gradient algorithm.
Why It Matters
This theorem makes policy optimization tractable. To estimate , sample trajectories under , compute (easy. just backprop through the policy network), multiply by a return estimate, and average. No environment model needed. This is the basis of all policy gradient algorithms.
Failure Mode
The naive estimator has extremely high variance. A single trajectory gives a very noisy gradient estimate. Variance reduction techniques (baselines, advantage estimation) are essential for practical use.
Baseline Does Not Change the Gradient
Statement
For any function that depends only on the state:
Therefore, replacing with in the policy gradient does not change its expectation but reduces variance.
Intuition
The score function has zero expectation under (a standard property of score functions). So subtracting any state-dependent constant from the reward does not bias the gradient. But it does change the variance. And is usually a good baseline because it centers the returns around zero.
Proof Sketch
Why It Matters
This is why subtracting the value function baseline to form the advantage is everywhere in modern RL. The baseline is . The advantage is what remains after subtracting it. Using the advantage instead of the raw -value dramatically reduces variance, making practical training possible. The value function is the standard practical baseline, not the variance-minimizing one. The exact minimum-variance baseline is derived in the exercises.
The REINFORCE Algorithm
The simplest policy gradient algorithm. Sample a trajectory under , then update:
where is the return from time .
REINFORCE is unbiased but has high variance. In practice, subtract a baseline:
Actor-Critic Methods
Instead of waiting for a full trajectory to compute , use a learned value function as both a baseline and a bootstrap target:
This is the TD(0) advantage estimate, the one-step TD residual . Update the policy (actor) using:
Update the value function (critic) by minimizing or using TD learning. Actor-critic trades some bias (from bootstrapping) for dramatically lower variance compared to REINFORCE. The 1-step residual and the Monte Carlo return are two ends of a spectrum; n-step returns and the GAE estimator below interpolate between them.
Natural Policy Gradient
The natural gradient is due to Amari (1998), "Natural Gradient Works Efficiently in Learning," which applies information geometry to parameter estimation. Kakade (2001), "A Natural Policy Gradient," transferred the idea to RL by identifying the policy Fisher information as the correct metric on policy space.
Natural Policy Gradient
Statement
The Fisher information matrix of the policy is:
The natural policy gradient is:
This is the steepest ascent direction in the KL-divergence metric on policy space, rather than in Euclidean parameter space.
Intuition
The standard gradient depends on the parameterization: reparameterizing changes the gradient direction. The natural gradient is invariant to reparameterization. It asks: what is the best policy update of a given KL size? This is the right question because we care about how much the policy changes, not how much the parameters change.
Why It Matters
The natural policy gradient leads to more stable updates and is the theoretical foundation for TRPO. PPO shares the KL-bounded surrogate framing but is a first-order method, not a natural-gradient approximation.
Connection to Trust Regions
The natural policy gradient motivates trust region methods:
TRPO (Trust Region Policy Optimization), Schulman et al. (2015), arXiv 1502.05477. Maximize the surrogate objective subject to . The step direction is computed by linearizing the objective and quadratically approximating the KL constraint, which reduces to solving for the natural-gradient direction. TRPO solves this linear system with conjugate gradients using Fisher-vector products (no explicit Fisher matrix is formed), then performs a backtracking line search along that direction to enforce the exact KL bound and verify surrogate improvement. The CG plus line-search structure is what makes TRPO a trust-region method, not a pure natural-gradient step.
PPO (Proximal Policy Optimization), Schulman et al. (2017). Replace the hard KL constraint with a clipped surrogate:
where . PPO is a first-order method that takes Adam steps on with standard backpropagation. It is not an approximate natural-gradient method, despite a common misreading. Engstrom et al. (2020), "Implementation Matters in Deep Policy Gradients" (arXiv 2005.12729), and Ilyas et al. (2020), "A Closer Look at Deep Policy Gradients" (arXiv 1811.02553), show that PPO's empirical advantage over vanilla policy gradient comes substantially from implementation details: value-function clipping, reward and advantage normalization, orthogonal weight initialization, learning-rate annealing, and observation normalization. The clipped objective alone does not account for PPO's reported gains. PPO is the algorithm used in the RLHF pipeline for language model training, but using it should not be confused with approximating the natural gradient.
Generalized Advantage Estimation
REINFORCE uses the full Monte Carlo return . The 1-step TD advantage uses . These are two endpoints of a spectrum. n-step returns sit between them:
Schulman et al. (2016), "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (arXiv 1506.02438), introduce the GAE estimator as an exponentially weighted average over n-step advantages:
Setting recovers the 1-step TD advantage (low variance, high bias from ). Setting recovers Monte Carlo with a value baseline (unbiased, high variance). Intermediate trades bias against variance. GAE is the standard advantage estimator used with PPO and TRPO in practice.
Actor-Critic Variants
Beyond the vanilla actor-critic above, several variants are in common use:
- A2C and A3C (Mnih et al. 2016, arXiv 1602.01783). Synchronous and asynchronous parallel actor-learners sharing a global network. A3C predates GPU-dominant training; A2C is the synchronous equivalent that is typically preferred on modern hardware.
- DPG and DDPG. The deterministic policy gradient theorem (Silver et al. 2014) extends the stochastic PG theorem to deterministic policies : the gradient becomes . DDPG (Lillicrap et al. 2016, arXiv 1509.02971) is the deep, off-policy actor-critic instantiation for continuous control.
- SAC (Haarnoja et al. 2018, arXiv 1801.01290). Soft Actor-Critic adds an entropy bonus to the reward, optimizing . Off-policy, stochastic, and the dominant algorithm on many continuous-control benchmarks.
Why the Log-Derivative Trick Works
The identity converts a gradient of a probability into an expectation under that probability. This is the same trick used in variational inference (the ELBO gradient) and in score-based diffusion models. It works whenever you need to differentiate an expectation with respect to parameters of the distribution.
The key property is that . The score function has zero mean. This is what makes baselines valid.
Common Confusions
Policy gradients are not backpropagation through the environment
A common misconception is that policy gradients differentiate through the environment dynamics. They do not. The gradient goes only through . The policy network. The environment is treated as a black box that produces rewards and next states. This is why policy gradients work in environments where the dynamics are unknown.
REINFORCE is unbiased but not practical without variance reduction
The raw REINFORCE estimator has the correct expectation but enormous variance. A single trajectory might have a return of 100 or -50 depending on luck. Without a baseline, you need an impractical number of samples. The advantage function and multiple-sample estimators are not optional extras; they are necessary for the algorithm to work at all.
Natural gradient is not just preconditioning
While the natural gradient can be seen as preconditioning the gradient by , its justification is geometric: it is the steepest ascent direction when distance is measured by KL divergence between policies, not Euclidean distance between parameters. This distinction matters because the same KL change can correspond to very different parameter changes depending on the region of parameter space.
Summary
- The policy gradient is
- The log-derivative trick converts a gradient of an expectation into an expectation of a gradient
- Subtracting a state-dependent baseline does not change the expected gradient but reduces variance
- The value function is the canonical baseline. The advantage is what remains after subtraction, not the baseline itself
- The exact variance-minimizing baseline is a -weighted average of -values (Weaver and Tao 2001, Greensmith et al. 2004), not
- Actor-critic: learn as a baseline, bootstrap to reduce variance at the cost of some bias
- GAE interpolates between 1-step TD advantage and the full Monte Carlo return
- Natural policy gradient (Amari 1998, Kakade 2001) uses Fisher information to make updates invariant to parameterization
- TRPO (Schulman et al. 2015) is a practical natural-gradient method using conjugate gradients and a backtracking line search
- PPO is a first-order clipped-objective method, not a natural-gradient approximation; its empirical gains come substantially from implementation details
Exercises
Problem
Show that for any state , assuming for all .
Problem
For a policy that is softmax over two actions with logits , compute .
Problem
Prove that the variance-minimizing baseline for the REINFORCE estimator is (a weighted average of -values, not simply ).
Related Comparisons
References
Canonical:
- Sutton, McAllester, Singh, Mansour, "Policy Gradient Methods for RL with Function Approximation" (2000)
- Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist RL" (1992). original REINFORCE
- Amari, "Natural Gradient Works Efficiently in Learning," Neural Computation (1998). origin of the natural gradient
- Kakade, "A Natural Policy Gradient" (2001). natural gradient transferred to RL
- Silver, Lever, Heess, Degris, Wierstra, Riedmiller, "Deterministic Policy Gradient Algorithms" (2014). DPG theorem
Variance reduction and advantage estimation:
- Weaver and Tao, "The Optimal Reward Baseline for Gradient-Based Reinforcement Learning" (2001)
- Greensmith, Bartlett, Baxter, "Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning" (2004). exact minimum-variance baseline
- Schulman, Moritz, Levine, Jordan, Abbeel, "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (arXiv 1506.02438, 2016). GAE
- Thomas, "Bias in Natural Actor-Critic Algorithms" (2014). discounted vs undiscounted state distribution
- Nota and Thomas, "Is the Policy Gradient a Gradient?" (2020)
Trust-region and PPO:
- Schulman, Levine, Abbeel, Jordan, Moritz, "Trust Region Policy Optimization" (arXiv 1502.05477, 2015). TRPO with CG and line search
- Schulman, Wolski, Dhariwal, Radford, Klimov, "Proximal Policy Optimization Algorithms" (arXiv 1707.06347, 2017)
- Engstrom, Ilyas, Santurkar, Tsipras, Janoos, Rudolph, Madry, "Implementation Matters in Deep Policy Gradients" (arXiv 2005.12729, 2020)
- Ilyas, Engstrom, Santurkar, Tsipras, Janoos, Rudolph, Madry, "A Closer Look at Deep Policy Gradients" (arXiv 1811.02553, 2020)
Actor-critic variants:
- Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu, "Asynchronous Methods for Deep Reinforcement Learning" (arXiv 1602.01783, 2016). A3C and A2C
- Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, Wierstra, "Continuous Control with Deep Reinforcement Learning" (arXiv 1509.02971, 2016). DDPG
- Haarnoja, Zhou, Abbeel, Levine, "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor" (arXiv 1801.01290, 2018)
Textbook:
- Agarwal, Jiang, Kakade, Sun, Reinforcement Learning: Theory and Algorithms (2022), Chapter 3
Next Topics
The natural next steps from policy gradients:
- RLHF and alignment: applying PPO to fine-tune language models from human preferences
- Transformer architecture: the neural network parameterizing in modern LLMs
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A