Policy Gradient Theorem

The Bellman equations give us a clean route to optimal control when the state space is small and dynamic programming is tractable. That is not the regime that matters most in modern RL. Atari from pixels, continuous robot control, large recommender systems, and RLHF for language models all require optimizing a parameterized policy directly.

In these settings, we parameterize the policy as $\pi_\theta$ (e.g., a neural network) and optimize the parameters $\theta$ by gradient ascent on the expected return. The policy gradient theorem tells us how to compute this gradient, even though the expected return depends on the entire trajectory distribution, which itself depends on $\theta$ in a complicated way.

This theorem is the foundation of REINFORCE, actor-critic methods, PPO, and the RLHF pipeline used to train modern language models.

Mental Model

We want to maximize $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$ , the expected total reward under the policy. The challenge is that changing $\theta$ changes the probability of every trajectory $\tau$ , which is a product of many policy and transition probabilities.

The log-derivative trick resolves this: instead of differentiating through the trajectory distribution directly, we express the gradient as an expectation under the current policy. This means we can estimate the gradient from sampled experience; no differentiable environment model is required.

The theorem packages two effects into one expectation:

the discounted occupancy $d_\gamma^{\pi_\theta}(s)$ says which states the current policy actually visits;
the score term $\nabla_\theta \log \pi_\theta(a|s)$ says how a parameter change reweights actions inside one visited state.

The non-obvious part is that the derivative of the state distribution does not disappear. It is absorbed into the visitation weights, which is why the final formula can be estimated from on-policy rollouts without differentiating the environment.

Formal Setup and Notation

We work in the MDP framework $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ with a parameterized stochastic policy $\pi_\theta(a|s)$ . A trajectory is $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$ .

Definition

Policy Objective $J (t h e t a)$

The policy objective is the expected discounted return:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right] = \sum_s d^{\pi_\theta}_\gamma(s) \sum_a \pi_\theta(a|s) Q^{\pi_\theta}(s,a)$

where $d^{\pi_\theta}_\gamma(s) = (1-\gamma)\sum_{t=0}^{\infty} \gamma^t P(s_t = s \mid \pi_\theta)$ is the discounted state-visitation distribution. This is the distribution that makes the Sutton et al. (2000) policy gradient theorem exact. In most deep-RL implementations the gradient is estimated using the undiscounted empirical distribution of states visited along sampled trajectories, which introduces a well-documented bias. See Thomas (2014), "Bias in Natural Actor-Critic Algorithms," and Nota and Thomas (2020), "Is the Policy Gradient a Gradient?"

Definition

Advantage Function $A^{p} i (s, a)$

The advantage function measures how much better action $a$ is compared to the average action under $\pi$ :

$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$

By construction, $\sum_a \pi(a|s) A^\pi(s,a) = 0$ for all $s$ .

Main Theorems

Theorem

Policy Gradient Theorem

Statement

The gradient of the policy objective is:

$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \, Q^{\pi_\theta}(s,a)\right]$

Equivalently, summing over time steps along a trajectory:

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t \nabla_\theta \log \pi_\theta(a_t|s_t) \, Q^{\pi_\theta}(s_t, a_t)\right]$

Intuition

$\nabla_\theta \log \pi_\theta(a|s)$ points in the direction that increases the probability of action $a$ in state $s$ . The theorem says: weight this direction by how good that action is ( $Q$ -value), and average over the states and actions you actually visit. Actions that lead to high returns get their probabilities increased; actions that lead to low returns get decreased.

Proof Sketch

Write $V^{\pi_\theta}(s) = \sum_a \pi_\theta(a \mid s) Q^{\pi_\theta}(s,a)$ and differentiate. Using $Q^{\pi_\theta}(s,a) = r(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V^{\pi_\theta}(s')$ :

$\nabla_\theta V^{\pi_\theta}(s) = \sum_a \bigl[ \nabla_\theta \pi_\theta(a \mid s) \, Q^{\pi_\theta}(s,a) + \pi_\theta(a \mid s) \, \nabla_\theta Q^{\pi_\theta}(s,a) \bigr].$

The $r(s,a)$ term has no $\theta$ -dependence, so $\nabla_\theta Q^{\pi_\theta}(s,a) = \gamma \sum_{s'} P(s' \mid s,a) \nabla_\theta V^{\pi_\theta}(s')$ . Substituting gives a one-step recursion on $\nabla_\theta V^{\pi_\theta}$ .

Unrolling the recursion yields

$\nabla_\theta V^{\pi_\theta}(s_0) = \sum_{s} \underbrace{\sum_{k=0}^{\infty} \gamma^k \Pr(s_0 \to s \text{ in } k \text{ steps}; \pi_\theta)}_{\eta(s)} \, \sum_a \nabla_\theta \pi_\theta(a \mid s) \, Q^{\pi_\theta}(s,a),$

where $\eta(s)$ is an unnormalized discounted visit count. The normalization is exact: $d^{\pi_\theta}_\gamma(s) = (1-\gamma) \, \eta(s)$ , since $\sum_s \eta(s) = \sum_{k=0}^\infty \gamma^k = 1/(1-\gamma)$ . The $(1-\gamma)$ factor matters. Sutton, McAllester, Singh, Mansour 2000 state it explicitly, but many practical implementations drop it and use an estimator based on the undiscounted empirical state distribution, which is biased. See the Thomas (2014) and Nota and Thomas (2020) references above.

What this calculation delivers: the messy $\nabla_\theta Q^{\pi_\theta}$ and $\nabla_\theta d^{\pi_\theta}$ terms telescope into the discounted state-visitation weights; the gradient acts on $\pi_\theta$ only.

Now apply the log-derivative trick $\nabla_\theta \pi_\theta(a \mid s) = \pi_\theta(a \mid s) \, \nabla_\theta \log \pi_\theta(a \mid s)$ and rescale $\eta$ by $(1-\gamma)$ to the distribution $d^{\pi_\theta}_\gamma$ :

$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\bigl[ \nabla_\theta \log \pi_\theta(a \mid s) \, Q^{\pi_\theta}(s,a) \bigr].$

The log-derivative step is what turns a sum over the discrete action set into an expectation you can estimate with samples, which is why this theorem is the starting point for every practical policy-gradient algorithm.

Why It Matters

This theorem makes policy optimization tractable. To estimate $\nabla_\theta J(\theta)$ , sample trajectories under $\pi_\theta$ , compute $\nabla_\theta \log \pi_\theta(a_t|s_t)$ by backpropagating through the policy network, multiply by a return estimate, and average. No environment model is needed. This is the basis of all policy gradient algorithms.

Failure Mode

The naive estimator has extremely high variance. A single trajectory gives a very noisy gradient estimate. Variance reduction techniques (baselines, advantage estimation) are essential for practical use.

report a correction →

Proposition

Baseline Does Not Change the Gradient

Statement

For any function $b: \mathcal{S} \to \mathbb{R}$ that depends only on the state:

$\mathbb{E}_{a \sim \pi_\theta(\cdot|s)}\left[\nabla_\theta \log \pi_\theta(a|s) \, b(s)\right] = 0$

Therefore, replacing $Q^{\pi_\theta}(s,a)$ with $A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s)$ in the policy gradient does not change its expectation but reduces variance.

Intuition

This is the expected grad-log-prob (EGLP) lemma: $\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a \mid s)] = 0$ for any $s$ (a standard property of score functions in any parametric family with differentiable density). Multiplying by a state-only function $b(s)$ pulls $b(s)$ out of the expectation, so it still evaluates to zero. Subtracting a state-dependent constant from the reward therefore does not bias the gradient, but it does change the variance. $V^\pi(s)$ is usually a good baseline because it centers the returns around zero.

Proof Sketch

$\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \, b(s)] = b(s) \sum_a \nabla_\theta \pi_\theta(a|s) = b(s) \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \nabla_\theta 1 = 0$

The middle step uses the EGLP identity $\sum_a \nabla_\theta \pi_\theta(a \mid s) = \nabla_\theta \sum_a \pi_\theta(a \mid s) = \nabla_\theta 1 = 0$ , which is exactly why the score function has zero mean.

Why It Matters

This is why subtracting the value function baseline $V^\pi(s)$ to form the advantage $A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$ is everywhere in modern RL. The baseline is $V$ . The advantage is what remains after subtracting it. Using the advantage instead of the raw $Q$ -value dramatically reduces variance, making practical training possible. The value function $V^\pi$ is the standard practical baseline, not the variance-minimizing one. The exact minimum-variance baseline is derived in the exercises.

report a correction →

Where the Variance Comes From

The theorem gives an unbiased gradient identity; it does not give a low-variance estimator. Most practical policy-gradient engineering is about removing variance without quietly changing the objective.

Noise source	What it does to the estimator	Standard fix
action sampling	Different sampled actions produce different score vectors even in the same state	larger batches, lower-entropy policies late in training
long-horizon returns	A reward at time $t+20$ perturbs the update for many earlier actions	reward-to-go, baselines, GAE
state visitation randomness	Rare but important states are sampled unevenly across rollouts	on-policy data refresh, trust-region updates, better exploration
critic approximation error	replacing returns with $\hat A_t$ lowers variance but can add bias	value-loss tuning, target stabilization, advantage normalization

This is why the clean theorem lives next to messy algorithms. The identity is simple; the estimator quality depends on how much of this noise you remove.

Five Valid Weights for the Score Function

The policy gradient takes the general form

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, \Phi_t\right],$

and there are several valid choices for the weight $\Phi_t$ that share the same expectation but differ wildly in variance. Spinning Up enumerates five:

Full return of the trajectory: $\Phi_t = R(\tau) = \sum_{k=0}^{T} \gamma^k r_k$ . The naive REINFORCE weighting. Unbiased but very high variance.
Reward-to-go: $\Phi_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ . Drops rewards received before time $t$ , which cannot depend on action $a_t$ .
Reward-to-go minus a state baseline: $\Phi_t = \sum_{k=t}^{T} \gamma^{k-t} r_k - b(s_t)$ . Valid for any $b$ by the EGLP lemma; $b(s) = V^\pi(s)$ is the standard choice.
Action-value function: $\Phi_t = Q^\pi(s_t, a_t)$ . This is the exact on-policy weight that emerges from the theorem.
Advantage function: $\Phi_t = A^\pi(s_t, a_t) = Q^\pi(s_t, a_t) - V^\pi(s_t)$ . Case 4 with the value-function baseline. This is what almost every modern algorithm (A2C, PPO, TRPO, SAC with actor updates) actually uses.

All five are unbiased estimators of the policy gradient. Variance drops sharply from form 1 to form 5 because each step removes a zero-mean noise source.

Watch Out

Don't let the past distract you

Why is the reward-to-go (form 2) better than the full return (form 1)? Both are unbiased. But rewards received before time $t$ are independent of the action $a_t$ taken at time $t$ , so $\mathbb{E}[\nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot r_{k<t}] = 0$ by the EGLP lemma. Past rewards contribute zero to the expectation but nonzero to the variance of the sample estimate. Dropping them is pure variance reduction, no bias introduced. Spinning Up calls this "don't let the past distract you."

The REINFORCE Algorithm

theorem visual

Subtracting a baseline keeps the gradient unbiased and cuts its variance

$REINFORCE weights every action by the raw return, so a noisy environment swamps the gradient signal. Subtracting any state-only baseline preserves the policy gradient in expectation while pushing high-return actions up and low-return actions down. Learned value baselines and advantage estimates from a critic shrink the variance further as training proceeds.$

Trajectory returns: signed advantage after the baseline

Gradient estimator variance over training

policy gradient

$\nabla J (θ) = E [\nabla lo g π_{θ} (a ∣ s) Q^{π} (s, a)]$

$Score-function estimator. Unbiased but high variance because Q^{π} has wide spread.$

REINFORCE

$g_{t} = \nabla lo g π_{θ} (a_{t} ∣ s_{t}) G_{t}$

$G_{t} is the Monte Carlo return. No critic, no baseline, no bias, but the variance grows with episode length.$

baseline subtraction

$g_{t} = \nabla lo g π_{θ} (a_{t} ∣ s_{t}) (G_{t} - b (s_{t}))$

$Any state-only b leaves E [g_{t}] unchanged. Optimal b is close to the value V^{π} (s_{t}) .$

advantage

$A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)$

$Centers the action-value at its state expectation. Same gradient direction with much lower variance.$

actor critic

$g_{t} = \nabla lo g π_{θ} (a_{t} ∣ s_{t}) \hat{A}_{t}$

$Use a learned critic to estimate A . Trades a small bias for a large variance reduction.$

natural gradient

$Δ θ = F^{- 1} \nabla J (θ)$

$Preconditions by the Fisher information so updates are invariant to policy reparameterization.$

The simplest policy gradient algorithm. Sample a trajectory $\tau = (s_0, a_0, r_0, \ldots, s_T, a_T, r_T)$ under $\pi_\theta$ , then update:

$\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \, G_t$

where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the return from time $t$ .

REINFORCE is unbiased but has high variance. In practice, subtract a baseline:

$\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \, (G_t - b(s_t))$

Actor-Critic Methods

Instead of waiting for a full trajectory to compute $G_t$ , use a learned value function $V_\phi(s)$ as both a baseline and a bootstrap target:

$\hat{A}_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) = \delta_t$

This is the TD(0) advantage estimate, the one-step TD residual $\delta_t$ . Update the policy (actor) using:

$\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) \hat{A}_t$

Update the value function (critic) by minimizing $(V_\phi(s_t) - G_t)^2$ or using TD learning. Actor-critic trades some bias (from bootstrapping) for dramatically lower variance compared to REINFORCE. The 1-step residual $\delta_t$ and the Monte Carlo return $G_t$ are two ends of a spectrum; n-step returns and the GAE $(\lambda)$ estimator below interpolate between them.

Natural Policy Gradient

The natural gradient is due to Amari (1998), "Natural Gradient Works Efficiently in Learning," which applies information geometry to parameter estimation. Kakade (2001), "A Natural Policy Gradient," transferred the idea to RL by identifying the policy Fisher information as the correct metric on policy space.

Proposition

Natural Policy Gradient

Statement

The Fisher information matrix of the policy is:

$F(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \, \nabla_\theta \log \pi_\theta(a|s)^\top\right]$

The natural policy gradient is:

$\tilde{\nabla}_\theta J(\theta) = F(\theta)^{-1} \nabla_\theta J(\theta)$

This is the steepest ascent direction in the KL-divergence metric on policy space, rather than in Euclidean parameter space.

Intuition

The standard gradient depends on the parameterization: reparameterizing $\theta$ changes the gradient direction. The natural gradient is invariant to reparameterization. It asks: what is the best policy update of a given KL size? This is the right question because we care about how much the policy changes, not how much the parameters change.

Why It Matters

The natural policy gradient leads to more stable updates and is the theoretical foundation for TRPO. PPO shares the KL-bounded surrogate framing but is a first-order method, not a natural-gradient approximation.

report a correction →

theorem visual

Natural Gradient Invariance Explorer

$Natural gradient rescales the step by the Fisher metric, so the move is defined in local KL geometry rather than by the raw coordinate chart.$

Current location

mu = 1.35

Euclidean step

Natural step

Current distribution

mu chart

mu^3 chart

distribution-space readout

same distributional move across charts

E in μ

0.015

E in μ^{3}

0.000

N in μ

0.004

N in μ^{3}

0.004

Fisher metric

I (μ) = \frac{1}{σ ^{2}}

In this Gaussian mean family, the Fisher metric sets the local KL scale that natural gradient uses to size the step.

Local KL

KL (p_{μ} ∥ p_{μ + δ}) \approx \frac{1}{2} I (μ) δ^{2}

The metric is the local quadratic form of KL. Equal natural steps should therefore spend the same local KL budget.

Invariant move

I (θ)^{- 1} \nabla_{θ} L \leftrightarrow I (ϕ)^{- 1} \nabla_{ϕ} L

Euclidean arrows depend on the chart. The natural arrows induce the same move in distribution space, which is the whole point of the metric correction.

The point of the explorer is simple: Euclidean parameter distance is chart dependent, but KL distance in policy space is not. Natural gradient chooses the update that is locally steepest in the geometry that the policy itself induces.

Connection to Trust Regions

The natural policy gradient motivates trust region methods:

TRPO (Trust Region Policy Optimization), Schulman et al. (2015), arXiv 1502.05477. Maximize the surrogate objective $\mathbb{E}[\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \hat{A}(s,a)]$ subject to $\text{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta) \leq \delta$ . The step direction is computed by linearizing the objective and quadratically approximating the KL constraint, which reduces to solving $F \mathbf{v} = \nabla_\theta L$ for the natural-gradient direction. TRPO solves this linear system with conjugate gradients using Fisher-vector products (no explicit Fisher matrix is formed), then performs a backtracking line search along that direction to enforce the exact KL bound and verify surrogate improvement. The CG plus line-search structure is what makes TRPO a trust-region method, not a pure natural-gradient step.

PPO (Proximal Policy Optimization), Schulman et al. (2017). Replace the hard KL constraint with a clipped surrogate:

$L^{\text{CLIP}}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta)\hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$

where $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t)$ . PPO is a first-order method that takes Adam steps on $L^{\text{CLIP}}$ with standard backpropagation. It is not an approximate natural-gradient method, despite a common misreading. Engstrom et al. (2020), "Implementation Matters in Deep Policy Gradients" (arXiv 2005.12729), and Ilyas et al. (2020), "A Closer Look at Deep Policy Gradients" (arXiv 1811.02553), show that PPO's empirical advantage over vanilla policy gradient comes substantially from implementation details: value-function clipping, reward and advantage normalization, orthogonal weight initialization, learning-rate annealing, and observation normalization. The clipped objective alone does not account for PPO's reported gains. PPO is the algorithm used in the RLHF pipeline for language model training, but using it should not be confused with approximating the natural gradient.

Generalized Advantage Estimation

REINFORCE uses the full Monte Carlo return $G_t$ . The 1-step TD advantage uses $\hat{A}_t^{(1)} = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) = \delta_t$ . These are two endpoints of a spectrum. n-step returns sit between them:

$\hat{A}_t^{(n)} = \sum_{l=0}^{n-1} \gamma^l \delta_{t+l}$

Schulman et al. (2016), "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (arXiv 1506.02438), introduce the GAE $(\lambda)$ estimator as an exponentially weighted average over n-step advantages:

$\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \, \delta_{t+l}, \qquad \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$

Setting $\lambda = 0$ recovers the 1-step TD advantage (low variance, high bias from $V_\phi$ ). Setting $\lambda = 1$ recovers Monte Carlo with a value baseline (unbiased, high variance). Intermediate $\lambda \in (0,1)$ trades bias against variance. GAE is the standard advantage estimator used with PPO and TRPO in practice.

What the Theorem Does Not Give You for Free

Watch Out

Action-dependent baselines need a correction term

The state-only baseline result is exact because $\mathbb{E}_{a \sim \pi_\theta(\cdot|s)}[\nabla_\theta \log \pi_\theta(a|s)\, b(s)] = 0$ . If you replace $b(s)$ with an action-dependent term $b(s,a)$ , that identity fails in general. The estimator becomes biased unless you add an explicit correction. This is why the standard theorem is careful about state-only baselines.

Watch Out

The theorem is on-policy

The expectation in the policy gradient theorem is under the current policy's occupancy measure. If the data come from an older policy or a replay buffer, you are no longer estimating the same expectation unless you add importance weights or use an off-policy actor-critic derivation. Off-policy data reuse is possible, but it is not a free corollary of the theorem.

Actor-Critic Variants

Beyond the vanilla actor-critic above, several variants are in common use:

A2C and A3C (Mnih et al. 2016, arXiv 1602.01783). Synchronous and asynchronous parallel actor-learners sharing a global network. A3C predates GPU-dominant training; A2C is the synchronous equivalent that is typically preferred on modern hardware.
DPG and DDPG. The deterministic policy gradient theorem (Silver et al. 2014) extends the stochastic PG theorem to deterministic policies $a = \mu_\theta(s)$ : the gradient becomes $\nabla_\theta J = \mathbb{E}_{s \sim d^\mu}[\nabla_\theta \mu_\theta(s) \, \nabla_a Q^\mu(s,a)|_{a = \mu_\theta(s)}]$ . DDPG (Lillicrap et al. 2016, arXiv 1509.02971) is the deep, off-policy actor-critic instantiation for continuous control.
SAC (Haarnoja et al. 2018, arXiv 1801.01290). Soft Actor-Critic adds an entropy bonus to the reward, optimizing $\mathbb{E}[\sum_t r_t + \alpha \mathcal{H}(\pi(\cdot|s_t))]$ . Off-policy, stochastic, and the dominant algorithm on many continuous-control benchmarks.

Why the Log-Derivative Trick Works

The identity $\nabla_\theta \pi_\theta(a|s) = \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s)$ converts a gradient of a probability into an expectation under that probability. This is the same trick used in variational inference (the ELBO gradient) and in score-based diffusion models. It works whenever you need to differentiate an expectation with respect to parameters of the distribution.

The key property is that $\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)] = 0$ . The score function has zero mean. This is what makes baselines valid.

Common Confusions

Watch Out

Policy gradients are not backpropagation through the environment

A common misconception is that policy gradients differentiate through the environment dynamics. They do not. The gradient goes only through $\log \pi_\theta$ . The policy network. The environment is treated as a black box that produces rewards and next states. This is why policy gradients work in environments where the dynamics are unknown.

Watch Out

REINFORCE is unbiased but not practical without variance reduction

The raw REINFORCE estimator has the correct expectation but enormous variance. A single trajectory might have a return of 100 or -50 depending on luck. Without a baseline, you need an impractical number of samples. The advantage function and multiple-sample estimators are not optional extras; they are necessary for the algorithm to work at all.

Watch Out

Natural gradient is not just preconditioning

While the natural gradient can be seen as preconditioning the gradient by $F^{-1}$ , its justification is geometric: it is the steepest ascent direction when distance is measured by KL divergence between policies, not Euclidean distance between parameters. This distinction matters because the same KL change can correspond to very different parameter changes depending on the region of parameter space.

Summary

The policy gradient is $\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \, Q^{\pi_\theta}(s,a)]$
The log-derivative trick converts a gradient of an expectation into an expectation of a gradient
Subtracting a state-dependent baseline does not change the expected gradient but reduces variance
The value function $V^\pi$ is the canonical baseline. The advantage $A^\pi = Q^\pi - V^\pi$ is what remains after subtraction, not the baseline itself
The exact variance-minimizing baseline is a $\|\nabla \log \pi\|^2$ -weighted average of $Q$ -values (Weaver and Tao 2001, Greensmith et al. 2004), not $V^\pi$
Actor-critic: learn $V_\phi$ as a baseline, bootstrap to reduce variance at the cost of some bias
GAE $(\lambda)$ interpolates between 1-step TD advantage and the full Monte Carlo return
Natural policy gradient (Amari 1998, Kakade 2001) uses Fisher information to make updates invariant to parameterization
TRPO (Schulman et al. 2015) is a practical natural-gradient method using conjugate gradients and a backtracking line search
PPO is a first-order clipped-objective method, not a natural-gradient approximation; its empirical gains come substantially from implementation details

Exercises

ExerciseCore

Problem

Show that $\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)] = 0$ for any state $s$ , assuming $\pi_\theta(a|s) > 0$ for all $a$ .

ExerciseCore

Problem

For a policy $\pi_\theta$ that is softmax over two actions with logits $\theta_1, \theta_2$ , compute $\nabla_\theta \log \pi_\theta(a_1|s)$ .

ExerciseAdvanced

Problem

Show that replacing the full trajectory return $R(\tau)$ with the reward-to-go $G_t = \sum_{k=t}^{T}\gamma^{k-t} r_k$ leaves the REINFORCE estimator unbiased. In other words, prove that rewards received before time $t$ do not contribute to the expectation of the $t$ -th score term.

ExerciseAdvanced

Problem

Prove that the variance-minimizing baseline $b^*(s)$ for the REINFORCE estimator is $b^*(s) = \frac{\mathbb{E}_a[\|\nabla_\theta \log \pi_\theta(a|s)\|^2 Q^{\pi}(s,a)]}{\mathbb{E}_a[\|\nabla_\theta \log \pi_\theta(a|s)\|^2]}$ (a weighted average of $Q$ -values, not simply $V^\pi(s)$ ).

Related Comparisons

On-Policy vs. Off-Policy Learning

References

Canonical:

Sutton, McAllester, Singh, Mansour, "Policy Gradient Methods for RL with Function Approximation" (2000)
Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist RL" (1992). original REINFORCE
Amari, "Natural Gradient Works Efficiently in Learning," Neural Computation (1998). origin of the natural gradient
Kakade, "A Natural Policy Gradient" (2001). natural gradient transferred to RL
Silver, Lever, Heess, Degris, Wierstra, Riedmiller, "Deterministic Policy Gradient Algorithms" (2014). DPG theorem

Variance reduction and advantage estimation:

Weaver and Tao, "The Optimal Reward Baseline for Gradient-Based Reinforcement Learning" (2001)
Greensmith, Bartlett, Baxter, "Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning" (2004). exact minimum-variance baseline
Schulman, Moritz, Levine, Jordan, Abbeel, "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (arXiv 1506.02438, 2016). GAE
Thomas, "Bias in Natural Actor-Critic Algorithms" (2014). discounted vs undiscounted state distribution
Nota and Thomas, "Is the Policy Gradient a Gradient?" (2020)

Trust-region and PPO:

Schulman, Levine, Abbeel, Jordan, Moritz, "Trust Region Policy Optimization" (arXiv 1502.05477, 2015). TRPO with CG and line search
Schulman, Wolski, Dhariwal, Radford, Klimov, "Proximal Policy Optimization Algorithms" (arXiv 1707.06347, 2017)
Engstrom, Ilyas, Santurkar, Tsipras, Janoos, Rudolph, Madry, "Implementation Matters in Deep Policy Gradients" (arXiv 2005.12729, 2020)
Ilyas, Engstrom, Santurkar, Tsipras, Janoos, Rudolph, Madry, "A Closer Look at Deep Policy Gradients" (arXiv 1811.02553, 2020)

Actor-critic variants:

Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu, "Asynchronous Methods for Deep Reinforcement Learning" (arXiv 1602.01783, 2016). A3C and A2C
Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, Wierstra, "Continuous Control with Deep Reinforcement Learning" (arXiv 1509.02971, 2016). DDPG
Haarnoja, Zhou, Abbeel, Levine, "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor" (arXiv 1801.01290, 2018)

Textbook:

Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., 2018), Chapter 13
Agarwal, Jiang, Kakade, Sun, Reinforcement Learning: Theory and Algorithms (2022), Chapter 3

Next Topics

The natural next steps from policy gradients:

Actor-Critic Methods: the variance-reduced family that turns the baseline into a learned critic
Policy Optimization: PPO and TRPO: trust-region and clipped-surrogate descendants of the theorem
RLHF and alignment: how policy-gradient ideas reappear in preference optimization and PPO-style post-training

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

7

Convex Optimization Basicslayer 1 · tier 1
Markov Decision Processeslayer 2 · tier 1
Q-Learninglayer 2 · tier 1
Value Iteration and Policy Iterationlayer 2 · tier 1
Multi-Armed Bandits Theorylayer 2 · tier 2

Derived topics

13

Reinforcement Learning from Human Feedbacklayer 5 · tier 1
Actor-Critic Methodslayer 3 · tier 2
DDPG: Deep Deterministic Policy Gradientlayer 3 · tier 2
Policy Optimization: PPO and TRPOlayer 3 · tier 2
TD3: Twin Delayed Deep Deterministic Policy Gradientlayer 3 · tier 2

+8 more on the derived-topics page.

Graph-backed continuations

Actor-Critic Methods Policy Optimization: PPO and TRPO RLHF and Alignment Agentic RL and Tool Use DDPG: Deep Deterministic Policy Gradient Deep RL for Control DPO vs GRPO vs RL for Reasoning Multi-Agent Collaboration Reinforcement Learning for Drug Discovery Reinforcement Learning for Synthesis Planning Reinforcement Learning from Human Feedback Reward Systems and Reinforcement Learning Neuroscience TD3: Twin Delayed Deep Deterministic Policy Gradient