Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Policy Gradient Theorem

The fundamental result enabling gradient-based optimization of parameterized policies: the policy gradient theorem and the algorithms it spawns.

AdvancedTier 1Stable~65 min

Why This Matters

The Bellman equations give us a way to solve MDPs when we know the transition model and the state space is small. But what about Atari games with pixel inputs? Robot control with continuous actions? Language model fine-tuning?

In these settings, we parameterize the policy as πθ\pi_\theta (e.g., a neural network) and optimize the parameters θ\theta by gradient ascent on the expected return. The policy gradient theorem tells us how to compute this gradient. even though the expected return depends on the entire trajectory distribution, which itself depends on θ\theta in a complicated way.

This theorem is the foundation of REINFORCE, actor-critic methods, PPO, and the RLHF pipeline used to train modern language models.

Mental Model

We want to maximize J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)], the expected total reward under the policy. The challenge is that changing θ\theta changes the probability of every trajectory τ\tau, which is a product of many policy and transition probabilities.

The log-derivative trick resolves this: instead of differentiating through the trajectory distribution directly, we express the gradient as an expectation under the current policy. This means we can estimate the gradient by sampling trajectories. no model of the environment needed.

Formal Setup and Notation

We work in the MDP framework (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma) with a parameterized stochastic policy πθ(as)\pi_\theta(a|s). A trajectory is τ=(s0,a0,r0,s1,a1,r1,)\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots).

Definition

Policy Objective

The policy objective is the expected discounted return:

J(θ)=Eτπθ[t=0γtrt]=sdγπθ(s)aπθ(as)Qπθ(s,a)J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right] = \sum_s d^{\pi_\theta}_\gamma(s) \sum_a \pi_\theta(a|s) Q^{\pi_\theta}(s,a)

where dγπθ(s)=(1γ)t=0γtP(st=sπθ)d^{\pi_\theta}_\gamma(s) = (1-\gamma)\sum_{t=0}^{\infty} \gamma^t P(s_t = s \mid \pi_\theta) is the discounted state-visitation distribution. This is the distribution that makes the Sutton et al. (2000) policy gradient theorem exact. In most deep-RL implementations the gradient is estimated using the undiscounted empirical distribution of states visited along sampled trajectories, which introduces a well-documented bias. See Thomas (2014), "Bias in Natural Actor-Critic Algorithms," and Nota and Thomas (2020), "Is the Policy Gradient a Gradient?"

Definition

Advantage Function

The advantage function measures how much better action aa is compared to the average action under π\pi:

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

By construction, aπ(as)Aπ(s,a)=0\sum_a \pi(a|s) A^\pi(s,a) = 0 for all ss.

Main Theorems

Theorem

Policy Gradient Theorem

Statement

The gradient of the policy objective is:

θJ(θ)=Esdπθ,aπθ[θlogπθ(as)Qπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \, Q^{\pi_\theta}(s,a)\right]

Equivalently, summing over time steps along a trajectory:

θJ(θ)=Eτπθ[t=0γtθlogπθ(atst)Qπθ(st,at)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t \nabla_\theta \log \pi_\theta(a_t|s_t) \, Q^{\pi_\theta}(s_t, a_t)\right]

Intuition

θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s) points in the direction that increases the probability of action aa in state ss. The theorem says: weight this direction by how good that action is (QQ-value), and average over the states and actions you actually visit. Actions that lead to high returns get their probabilities increased; actions that lead to low returns get decreased.

Proof Sketch

Write Vπθ(s)=aπθ(as)Qπθ(s,a)V^{\pi_\theta}(s) = \sum_a \pi_\theta(a \mid s) Q^{\pi_\theta}(s,a) and differentiate. Using Qπθ(s,a)=r(s,a)+γsP(ss,a)Vπθ(s)Q^{\pi_\theta}(s,a) = r(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V^{\pi_\theta}(s'):

θVπθ(s)=a[θπθ(as)Qπθ(s,a)+πθ(as)θQπθ(s,a)].\nabla_\theta V^{\pi_\theta}(s) = \sum_a \bigl[ \nabla_\theta \pi_\theta(a \mid s) \, Q^{\pi_\theta}(s,a) + \pi_\theta(a \mid s) \, \nabla_\theta Q^{\pi_\theta}(s,a) \bigr].

The r(s,a)r(s,a) term has no θ\theta-dependence, so θQπθ(s,a)=γsP(ss,a)θVπθ(s)\nabla_\theta Q^{\pi_\theta}(s,a) = \gamma \sum_{s'} P(s' \mid s,a) \nabla_\theta V^{\pi_\theta}(s'). Substituting gives a one-step recursion on θVπθ\nabla_\theta V^{\pi_\theta}.

Unrolling the recursion yields

θVπθ(s0)=sk=0γkPr(s0s in k steps;πθ)η(s)aθπθ(as)Qπθ(s,a),\nabla_\theta V^{\pi_\theta}(s_0) = \sum_{s} \underbrace{\sum_{k=0}^{\infty} \gamma^k \Pr(s_0 \to s \text{ in } k \text{ steps}; \pi_\theta)}_{\eta(s)} \, \sum_a \nabla_\theta \pi_\theta(a \mid s) \, Q^{\pi_\theta}(s,a),

where η(s)\eta(s) is an unnormalized discounted visit count. The normalization is exact: dγπθ(s)=(1γ)η(s)d^{\pi_\theta}_\gamma(s) = (1-\gamma) \, \eta(s), since sη(s)=k=0γk=1/(1γ)\sum_s \eta(s) = \sum_{k=0}^\infty \gamma^k = 1/(1-\gamma). The (1γ)(1-\gamma) factor matters. Sutton, McAllester, Singh, Mansour 2000 state it explicitly, but many practical implementations drop it and use an estimator based on the undiscounted empirical state distribution, which is biased. See the Thomas (2014) and Nota and Thomas (2020) references above.

The crucial thing this calculation delivers is that the messy θQπθ\nabla_\theta Q^{\pi_\theta} and θdπθ\nabla_\theta d^{\pi_\theta} terms telescope into the discounted state-visitation weights; the gradient acts on πθ\pi_\theta only.

Now apply the log-derivative trick θπθ(as)=πθ(as)θlogπθ(as)\nabla_\theta \pi_\theta(a \mid s) = \pi_\theta(a \mid s) \, \nabla_\theta \log \pi_\theta(a \mid s) and rescale η\eta by (1γ)(1-\gamma) to the distribution dγπθd^{\pi_\theta}_\gamma:

θJ(θ)=Esdπθ,aπθ[θlogπθ(as)Qπθ(s,a)].\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\bigl[ \nabla_\theta \log \pi_\theta(a \mid s) \, Q^{\pi_\theta}(s,a) \bigr].

The log-derivative step is what turns a sum over the discrete action set into an expectation you can estimate with samples, which is why this theorem is the starting point for every practical policy-gradient algorithm.

Why It Matters

This theorem makes policy optimization tractable. To estimate θJ(θ)\nabla_\theta J(\theta), sample trajectories under πθ\pi_\theta, compute θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t|s_t) (easy. just backprop through the policy network), multiply by a return estimate, and average. No environment model needed. This is the basis of all policy gradient algorithms.

Failure Mode

The naive estimator has extremely high variance. A single trajectory gives a very noisy gradient estimate. Variance reduction techniques (baselines, advantage estimation) are essential for practical use.

Proposition

Baseline Does Not Change the Gradient

Statement

For any function b:SRb: \mathcal{S} \to \mathbb{R} that depends only on the state:

Eaπθ(s)[θlogπθ(as)b(s)]=0\mathbb{E}_{a \sim \pi_\theta(\cdot|s)}\left[\nabla_\theta \log \pi_\theta(a|s) \, b(s)\right] = 0

Therefore, replacing Qπθ(s,a)Q^{\pi_\theta}(s,a) with Aπθ(s,a)=Qπθ(s,a)Vπθ(s)A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s) in the policy gradient does not change its expectation but reduces variance.

Intuition

The score function θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s) has zero expectation under πθ\pi_\theta (a standard property of score functions). So subtracting any state-dependent constant from the reward does not bias the gradient. But it does change the variance. And Vπ(s)V^\pi(s) is usually a good baseline because it centers the returns around zero.

Proof Sketch

Eaπθ[θlogπθ(as)b(s)]=b(s)aθπθ(as)=b(s)θaπθ(as)=b(s)θ1=0\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \, b(s)] = b(s) \sum_a \nabla_\theta \pi_\theta(a|s) = b(s) \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \nabla_\theta 1 = 0

Why It Matters

This is why subtracting the value function baseline Vπ(s)V^\pi(s) to form the advantage Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s) is everywhere in modern RL. The baseline is VV. The advantage is what remains after subtracting it. Using the advantage instead of the raw QQ-value dramatically reduces variance, making practical training possible. The value function VπV^\pi is the standard practical baseline, not the variance-minimizing one. The exact minimum-variance baseline is derived in the exercises.

The REINFORCE Algorithm

The simplest policy gradient algorithm. Sample a trajectory τ=(s0,a0,r0,,sT,aT,rT)\tau = (s_0, a_0, r_0, \ldots, s_T, a_T, r_T) under πθ\pi_\theta, then update:

θθ+αt=0Tθlogπθ(atst)Gt\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \, G_t

where Gt=k=tTγktrkG_t = \sum_{k=t}^{T} \gamma^{k-t} r_k is the return from time tt.

REINFORCE is unbiased but has high variance. In practice, subtract a baseline:

θθ+αt=0Tθlogπθ(atst)(Gtb(st))\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \, (G_t - b(s_t))

Actor-Critic Methods

Instead of waiting for a full trajectory to compute GtG_t, use a learned value function Vϕ(s)V_\phi(s) as both a baseline and a bootstrap target:

A^t=rt+γVϕ(st+1)Vϕ(st)=δt\hat{A}_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) = \delta_t

This is the TD(0) advantage estimate, the one-step TD residual δt\delta_t. Update the policy (actor) using:

θθ+αθlogπθ(atst)A^t\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) \hat{A}_t

Update the value function (critic) by minimizing (Vϕ(st)Gt)2(V_\phi(s_t) - G_t)^2 or using TD learning. Actor-critic trades some bias (from bootstrapping) for dramatically lower variance compared to REINFORCE. The 1-step residual δt\delta_t and the Monte Carlo return GtG_t are two ends of a spectrum; n-step returns and the GAE(λ)(\lambda) estimator below interpolate between them.

Natural Policy Gradient

The natural gradient is due to Amari (1998), "Natural Gradient Works Efficiently in Learning," which applies information geometry to parameter estimation. Kakade (2001), "A Natural Policy Gradient," transferred the idea to RL by identifying the policy Fisher information as the correct metric on policy space.

Proposition

Natural Policy Gradient

Statement

The Fisher information matrix of the policy is:

F(θ)=Esdπθ,aπθ[θlogπθ(as)θlogπθ(as)]F(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \, \nabla_\theta \log \pi_\theta(a|s)^\top\right]

The natural policy gradient is:

~θJ(θ)=F(θ)1θJ(θ)\tilde{\nabla}_\theta J(\theta) = F(\theta)^{-1} \nabla_\theta J(\theta)

This is the steepest ascent direction in the KL-divergence metric on policy space, rather than in Euclidean parameter space.

Intuition

The standard gradient depends on the parameterization: reparameterizing θ\theta changes the gradient direction. The natural gradient is invariant to reparameterization. It asks: what is the best policy update of a given KL size? This is the right question because we care about how much the policy changes, not how much the parameters change.

Why It Matters

The natural policy gradient leads to more stable updates and is the theoretical foundation for TRPO. PPO shares the KL-bounded surrogate framing but is a first-order method, not a natural-gradient approximation.

Connection to Trust Regions

The natural policy gradient motivates trust region methods:

TRPO (Trust Region Policy Optimization), Schulman et al. (2015), arXiv 1502.05477. Maximize the surrogate objective E[πθ(as)πθold(as)A^(s,a)]\mathbb{E}[\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \hat{A}(s,a)] subject to KL(πθoldπθ)δ\text{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta) \leq \delta. The step direction is computed by linearizing the objective and quadratically approximating the KL constraint, which reduces to solving Fv=θLF \mathbf{v} = \nabla_\theta L for the natural-gradient direction. TRPO solves this linear system with conjugate gradients using Fisher-vector products (no explicit Fisher matrix is formed), then performs a backtracking line search along that direction to enforce the exact KL bound and verify surrogate improvement. The CG plus line-search structure is what makes TRPO a trust-region method, not a pure natural-gradient step.

PPO (Proximal Policy Optimization), Schulman et al. (2017). Replace the hard KL constraint with a clipped surrogate:

LCLIP(θ)=E[min(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\text{CLIP}}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta)\hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]

where rt(θ)=πθ(atst)/πθold(atst)r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t). PPO is a first-order method that takes Adam steps on LCLIPL^{\text{CLIP}} with standard backpropagation. It is not an approximate natural-gradient method, despite a common misreading. Engstrom et al. (2020), "Implementation Matters in Deep Policy Gradients" (arXiv 2005.12729), and Ilyas et al. (2020), "A Closer Look at Deep Policy Gradients" (arXiv 1811.02553), show that PPO's empirical advantage over vanilla policy gradient comes substantially from implementation details: value-function clipping, reward and advantage normalization, orthogonal weight initialization, learning-rate annealing, and observation normalization. The clipped objective alone does not account for PPO's reported gains. PPO is the algorithm used in the RLHF pipeline for language model training, but using it should not be confused with approximating the natural gradient.

Generalized Advantage Estimation

REINFORCE uses the full Monte Carlo return GtG_t. The 1-step TD advantage uses A^t(1)=rt+γVϕ(st+1)Vϕ(st)=δt\hat{A}_t^{(1)} = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) = \delta_t. These are two endpoints of a spectrum. n-step returns sit between them:

A^t(n)=l=0n1γlδt+l\hat{A}_t^{(n)} = \sum_{l=0}^{n-1} \gamma^l \delta_{t+l}

Schulman et al. (2016), "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (arXiv 1506.02438), introduce the GAE(λ)(\lambda) estimator as an exponentially weighted average over n-step advantages:

A^tGAE(γ,λ)=l=0(γλ)lδt+l,δt=rt+γVϕ(st+1)Vϕ(st)\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \, \delta_{t+l}, \qquad \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)

Setting λ=0\lambda = 0 recovers the 1-step TD advantage (low variance, high bias from VϕV_\phi). Setting λ=1\lambda = 1 recovers Monte Carlo with a value baseline (unbiased, high variance). Intermediate λ(0,1)\lambda \in (0,1) trades bias against variance. GAE is the standard advantage estimator used with PPO and TRPO in practice.

Actor-Critic Variants

Beyond the vanilla actor-critic above, several variants are in common use:

  • A2C and A3C (Mnih et al. 2016, arXiv 1602.01783). Synchronous and asynchronous parallel actor-learners sharing a global network. A3C predates GPU-dominant training; A2C is the synchronous equivalent that is typically preferred on modern hardware.
  • DPG and DDPG. The deterministic policy gradient theorem (Silver et al. 2014) extends the stochastic PG theorem to deterministic policies a=μθ(s)a = \mu_\theta(s): the gradient becomes θJ=Esdμ[θμθ(s)aQμ(s,a)a=μθ(s)]\nabla_\theta J = \mathbb{E}_{s \sim d^\mu}[\nabla_\theta \mu_\theta(s) \, \nabla_a Q^\mu(s,a)|_{a = \mu_\theta(s)}]. DDPG (Lillicrap et al. 2016, arXiv 1509.02971) is the deep, off-policy actor-critic instantiation for continuous control.
  • SAC (Haarnoja et al. 2018, arXiv 1801.01290). Soft Actor-Critic adds an entropy bonus to the reward, optimizing E[trt+αH(π(st))]\mathbb{E}[\sum_t r_t + \alpha \mathcal{H}(\pi(\cdot|s_t))]. Off-policy, stochastic, and the dominant algorithm on many continuous-control benchmarks.

Why the Log-Derivative Trick Works

The identity θπθ(as)=πθ(as)θlogπθ(as)\nabla_\theta \pi_\theta(a|s) = \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s) converts a gradient of a probability into an expectation under that probability. This is the same trick used in variational inference (the ELBO gradient) and in score-based diffusion models. It works whenever you need to differentiate an expectation with respect to parameters of the distribution.

The key property is that Eaπθ[θlogπθ(as)]=0\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)] = 0. The score function has zero mean. This is what makes baselines valid.

Common Confusions

Watch Out

Policy gradients are not backpropagation through the environment

A common misconception is that policy gradients differentiate through the environment dynamics. They do not. The gradient goes only through logπθ\log \pi_\theta. The policy network. The environment is treated as a black box that produces rewards and next states. This is why policy gradients work in environments where the dynamics are unknown.

Watch Out

REINFORCE is unbiased but not practical without variance reduction

The raw REINFORCE estimator has the correct expectation but enormous variance. A single trajectory might have a return of 100 or -50 depending on luck. Without a baseline, you need an impractical number of samples. The advantage function and multiple-sample estimators are not optional extras; they are necessary for the algorithm to work at all.

Watch Out

Natural gradient is not just preconditioning

While the natural gradient can be seen as preconditioning the gradient by F1F^{-1}, its justification is geometric: it is the steepest ascent direction when distance is measured by KL divergence between policies, not Euclidean distance between parameters. This distinction matters because the same KL change can correspond to very different parameter changes depending on the region of parameter space.

Summary

  • The policy gradient is θJ(θ)=E[θlogπθ(as)Qπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \, Q^{\pi_\theta}(s,a)]
  • The log-derivative trick converts a gradient of an expectation into an expectation of a gradient
  • Subtracting a state-dependent baseline does not change the expected gradient but reduces variance
  • The value function VπV^\pi is the canonical baseline. The advantage Aπ=QπVπA^\pi = Q^\pi - V^\pi is what remains after subtraction, not the baseline itself
  • The exact variance-minimizing baseline is a logπ2\|\nabla \log \pi\|^2-weighted average of QQ-values (Weaver and Tao 2001, Greensmith et al. 2004), not VπV^\pi
  • Actor-critic: learn VϕV_\phi as a baseline, bootstrap to reduce variance at the cost of some bias
  • GAE(λ)(\lambda) interpolates between 1-step TD advantage and the full Monte Carlo return
  • Natural policy gradient (Amari 1998, Kakade 2001) uses Fisher information to make updates invariant to parameterization
  • TRPO (Schulman et al. 2015) is a practical natural-gradient method using conjugate gradients and a backtracking line search
  • PPO is a first-order clipped-objective method, not a natural-gradient approximation; its empirical gains come substantially from implementation details

Exercises

ExerciseCore

Problem

Show that Eaπθ[θlogπθ(as)]=0\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)] = 0 for any state ss, assuming πθ(as)>0\pi_\theta(a|s) > 0 for all aa.

ExerciseCore

Problem

For a policy πθ\pi_\theta that is softmax over two actions with logits θ1,θ2\theta_1, \theta_2, compute θlogπθ(a1s)\nabla_\theta \log \pi_\theta(a_1|s).

ExerciseAdvanced

Problem

Prove that the variance-minimizing baseline b(s)b^*(s) for the REINFORCE estimator is b(s)=Ea[θlogπθ(as)2Qπ(s,a)]Ea[θlogπθ(as)2]b^*(s) = \frac{\mathbb{E}_a[\|\nabla_\theta \log \pi_\theta(a|s)\|^2 Q^{\pi}(s,a)]}{\mathbb{E}_a[\|\nabla_\theta \log \pi_\theta(a|s)\|^2]} (a weighted average of QQ-values, not simply Vπ(s)V^\pi(s)).

Related Comparisons

References

Canonical:

  • Sutton, McAllester, Singh, Mansour, "Policy Gradient Methods for RL with Function Approximation" (2000)
  • Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist RL" (1992). original REINFORCE
  • Amari, "Natural Gradient Works Efficiently in Learning," Neural Computation (1998). origin of the natural gradient
  • Kakade, "A Natural Policy Gradient" (2001). natural gradient transferred to RL
  • Silver, Lever, Heess, Degris, Wierstra, Riedmiller, "Deterministic Policy Gradient Algorithms" (2014). DPG theorem

Variance reduction and advantage estimation:

  • Weaver and Tao, "The Optimal Reward Baseline for Gradient-Based Reinforcement Learning" (2001)
  • Greensmith, Bartlett, Baxter, "Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning" (2004). exact minimum-variance baseline
  • Schulman, Moritz, Levine, Jordan, Abbeel, "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (arXiv 1506.02438, 2016). GAE
  • Thomas, "Bias in Natural Actor-Critic Algorithms" (2014). discounted vs undiscounted state distribution
  • Nota and Thomas, "Is the Policy Gradient a Gradient?" (2020)

Trust-region and PPO:

  • Schulman, Levine, Abbeel, Jordan, Moritz, "Trust Region Policy Optimization" (arXiv 1502.05477, 2015). TRPO with CG and line search
  • Schulman, Wolski, Dhariwal, Radford, Klimov, "Proximal Policy Optimization Algorithms" (arXiv 1707.06347, 2017)
  • Engstrom, Ilyas, Santurkar, Tsipras, Janoos, Rudolph, Madry, "Implementation Matters in Deep Policy Gradients" (arXiv 2005.12729, 2020)
  • Ilyas, Engstrom, Santurkar, Tsipras, Janoos, Rudolph, Madry, "A Closer Look at Deep Policy Gradients" (arXiv 1811.02553, 2020)

Actor-critic variants:

  • Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu, "Asynchronous Methods for Deep Reinforcement Learning" (arXiv 1602.01783, 2016). A3C and A2C
  • Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, Wierstra, "Continuous Control with Deep Reinforcement Learning" (arXiv 1509.02971, 2016). DDPG
  • Haarnoja, Zhou, Abbeel, Levine, "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor" (arXiv 1801.01290, 2018)

Textbook:

  • Agarwal, Jiang, Kakade, Sun, Reinforcement Learning: Theory and Algorithms (2022), Chapter 3

Next Topics

The natural next steps from policy gradients:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics