RL Theory

DDPG: Deep Deterministic Policy Gradient

An off-policy actor-critic algorithm for continuous action spaces that combines a deterministic policy gradient with a DQN-style critic, using replay buffers and polyak-averaged target networks for stability.

AdvancedTier 2Stable~40 min
0

Why This Matters

Continuous action spaces break the tabular and discrete-action RL toolkit. You cannot iterate over all actions to take a max, and you cannot enumerate categorical policy logits. DDPG (Lillicrap et al., 2016, Continuous Control with Deep Reinforcement Learning) was the first deep RL algorithm to train reliably on high-dimensional continuous control tasks such as the MuJoCo locomotion suite.

DDPG is built on two pieces: the deterministic policy gradient theorem of Silver et al. (2014), which shows that θJ\nabla_\theta J has a clean form when π\pi is deterministic, and the DQN recipe of Mnih et al. (2015), which stabilizes off-policy value learning with replay buffers and target networks. DDPG is worth understanding even though TD3 and SAC have superseded it, because its failure modes (overestimation bias, brittle hyperparameters) motivated every continuous-control algorithm that followed.

The Deterministic Policy Gradient

Theorem

Deterministic Policy Gradient

Statement

Let μθ:SA\mu_\theta: \mathcal{S} \to \mathcal{A} be a deterministic policy and Qμθ(s,a)Q^{\mu_\theta}(s, a) its action-value function. Then

θJ(μθ)=Esdμθ[θμθ(s)aQμθ(s,a)a=μθ(s)].\nabla_\theta J(\mu_\theta) = \mathbb{E}_{s \sim d^{\mu_\theta}}\left[\nabla_\theta \mu_\theta(s) \, \nabla_a Q^{\mu_\theta}(s, a) \big|_{a = \mu_\theta(s)}\right].

Intuition

For a stochastic policy, the gradient uses the score function θlogπθ(as)\nabla_\theta \log \pi_\theta(a \mid s). For a deterministic policy there is no randomness over actions, so the gradient flows through the action: chain rule pushes the gradient from QQ backwards through μθ\mu_\theta. Conceptually you ask "which way should I nudge the policy so that QQ goes up at the action it actually takes?"

Proof Sketch

Silver et al. (2014) prove this by differentiating J(μθ)=dμθ(s)Qμθ(s,μθ(s))dsJ(\mu_\theta) = \int d^{\mu_\theta}(s) \, Q^{\mu_\theta}(s, \mu_\theta(s)) \, ds and showing that the term involving θdμθ\nabla_\theta d^{\mu_\theta} vanishes, leaving the expression above. This is the deterministic analogue of the policy gradient theorem and relies on the same visitation-distribution telescoping.

Why It Matters

This is the theoretical basis of every continuous-control actor-critic algorithm (DDPG, TD3, SAC actor updates). Instead of sampling actions and using a score-function estimator with high variance, the gradient goes straight through QQ via the reparameterization-style chain rule. Much lower variance than a stochastic-policy gradient.

Failure Mode

The theorem requires aQ(s,a)\nabla_a Q(s, a) to be informative at a=μθ(s)a = \mu_\theta(s). If QQ has sharp peaks or is badly fit, the gradient can point in arbitrary directions. In practice this is why DDPG is brittle to critic quality.

The DDPG Algorithm

DDPG maintains four networks: actor μθ\mu_\theta, critic QϕQ_\phi, and slow-moving target copies μθ\mu_{\theta'} and QϕQ_{\phi'}.

Definition

Polyak (exponential moving) averaging

Target network parameters are updated with a small step τ(0,1]\tau \in (0, 1] (typically τ=0.005\tau = 0.005) toward the online parameters each training step. This produces a smoothly lagging target instead of a hard copy every NN steps as in DQN. The goal is the same: decouple the bootstrap target from the parameters being updated, preventing runaway feedback.

Each training step does the following:

  1. Collect a transition (st,at,rt,st+1)(s_t, a_t, r_t, s_{t+1}) by acting with μθ(st)+ϵt\mu_\theta(s_t) + \epsilon_t where ϵt\epsilon_t is exploration noise (originally Ornstein-Uhlenbeck, later Gaussian).
  2. Store the transition in a replay buffer D\mathcal{D}.
  3. Sample a minibatch BDB \subset \mathcal{D} uniformly.
  4. Critic update: minimize the mean-squared Bellman error

LQ(ϕ)=E(s,a,r,s)B[(Qϕ(s,a)y)2],y=r+γQϕ(s,μθ(s)).L_Q(\phi) = \mathbb{E}_{(s, a, r, s') \sim B}\left[\left(Q_\phi(s, a) - y\right)^2\right], \quad y = r + \gamma Q_{\phi'}(s', \mu_{\theta'}(s')).

  1. Actor update: ascend the deterministic policy gradient

θJEsB[θμθ(s)aQϕ(s,a)a=μθ(s)].\nabla_\theta J \approx \mathbb{E}_{s \sim B}\left[\nabla_\theta \mu_\theta(s) \, \nabla_a Q_\phi(s, a) \big|_{a = \mu_\theta(s)}\right].

  1. Polyak-update both target networks: ϕτϕ+(1τ)ϕ\phi' \leftarrow \tau \phi + (1 - \tau) \phi' and similarly for θ\theta'.

Everything is off-policy: the replay buffer mixes data from many past policies, and the deterministic policy gradient remains valid because the action gradient is evaluated at μθ(s)\mu_\theta(s), not at the replayed action.

Why DDPG is Fragile

Watch Out

Overestimation bias in the critic

The critic target y=r+γQϕ(s,μθ(s))y = r + \gamma Q_{\phi'}(s', \mu_{\theta'}(s')) does not contain an explicit max\max over actions, so people often think DDPG avoids the DQN overestimation bias. It does not. Any noise in QϕQ_\phi gets exploited by the actor, which climbs up the noisy gradient until it finds actions where QϕQ_\phi is spuriously high. The critic then fits those bogus targets and the error compounds. This is the failure mode that motivated clipped-double-Q learning in TD3.

Watch Out

Exploration noise is not a regularizer

Adding Gaussian or OU noise to the actor at execution time is an exploration mechanism, not a regularizer on the policy. It does not prevent the critic from overfitting to sharp peaks in action space, because the target yy uses the noiseless target action μθ(s)\mu_{\theta'}(s'). TD3's target policy smoothing fixes this by adding clipped noise to the target action, which is a different fix.

Practical Notes

  • Hyperparameter brittleness: DDPG results in the literature vary wildly across seeds. Henderson et al. (2018), Deep Reinforcement Learning that Matters, documented this reproducibility crisis and used DDPG as their lead example.
  • Replay buffer size: typically 10610^6. Smaller buffers cause catastrophic forgetting; larger ones slow adaptation after the policy improves.
  • Reward scaling: DDPG is extremely sensitive to the reward scale because the critic targets are not normalized. Either scale rewards into [1,1][-1, 1] or use a target network update rate tuned to the reward magnitude.
  • Action scaling: the actor usually ends in a tanh\tanh so outputs lie in [1,1][-1, 1], then the environment-specific action bounds are applied by affine rescaling.
ExerciseCore

Problem

Write down the critic target yy used by DDPG and identify exactly which parameters are frozen (from the target networks) and which are trainable. Why would using QϕQ_\phi and μθ\mu_\theta on the right-hand side instead of QϕQ_{\phi'} and μθ\mu_{\theta'} cause training to diverge?

ExerciseAdvanced

Problem

The deterministic policy gradient theorem assumes μθ\mu_\theta is differentiable in θ\theta and QQ is differentiable in aa. Construct a simple continuous-action MDP where Qμ(s,)Q^{\mu}(s, \cdot) has a flat region around the optimal action. What happens to the DDPG actor update on that MDP, and which assumption of the theorem is being stressed?

Relation to Other Algorithms

DDPG sits between DQN and TD3/SAC in the continuous-control lineage:

  • DQN (Mnih et al. 2015) gave the replay-buffer + target-network recipe for discrete actions.
  • DPG (Silver et al. 2014) gave the deterministic policy gradient theorem.
  • DDPG (Lillicrap et al. 2016) combined the two for continuous actions.
  • TD3 (Fujimoto et al. 2018) added clipped double-Q, target policy smoothing, and delayed policy updates to fix DDPG's overestimation bias and brittleness. See the TD3 page.
  • SAC (Haarnoja et al. 2018) replaced the deterministic policy with a stochastic one maximizing entropy-regularized return, trading the clean DPG form for much stronger exploration and robustness.

References

  • Lillicrap, T. P. et al. (2016). Continuous Control with Deep Reinforcement Learning. ICLR 2016. The DDPG paper.
  • Silver, D. et al. (2014). Deterministic Policy Gradient Algorithms. ICML 2014. The DPG theorem.
  • Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature 518. DQN recipe.
  • Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press. Chapter 13 on policy gradient methods.
  • Achiam, J. (2018). Spinning Up in Deep RL. OpenAI. The DDPG chapter and pseudocode.
  • Henderson, P. et al. (2018). Deep Reinforcement Learning that Matters. AAAI. The reproducibility study using DDPG.
  • Fujimoto, S. et al. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML 2018. TD3 paper that diagnoses DDPG failure modes.

Next Topics

Last reviewed: April 2026

Machine-readable: markdown·JSON·raw MDX

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics