DDPG: Deep Deterministic Policy Gradient

Sneiderman, Robby

RL Theory

DDPG: Deep Deterministic Policy Gradient

An off-policy actor-critic algorithm for continuous action spaces that combines a deterministic policy gradient with a DQN-style critic, using replay buffers and polyak-averaged target networks for stability.

AdvancedTier 2StableSupporting~40 min

Prerequisites

Policy Gradient Theorem Q Learning Actor Critic Methods

Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 3 | tier 2. This page has 3 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

TD3: Twin Delayed Deep Deterministic Policy Gradient

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Continuous action spaces break the tabular and discrete-action RL toolkit. You cannot iterate over all actions to take a max, and you cannot enumerate categorical policy logits. DDPG (Lillicrap et al., 2016, Continuous Control with Deep Reinforcement Learning) was the first deep RL algorithm to train reliably on high-dimensional continuous control tasks such as the MuJoCo locomotion suite.

DDPG is built on two pieces: the deterministic policy gradient theorem of Silver et al. (2014), which shows that $\nabla_\theta J$ has a clean form when $\pi$ is deterministic, and the DQN recipe of Mnih et al. (2015), which stabilizes off-policy value learning with replay buffers and target networks. DDPG is worth understanding even though TD3 and SAC have superseded it, because its failure modes (overestimation bias, brittle hyperparameters) motivated every continuous-control algorithm that followed.

The Deterministic Policy Gradient

Theorem

Deterministic Policy Gradient

Statement

Let $\mu_\theta: \mathcal{S} \to \mathcal{A}$ be a deterministic policy and $Q^{\mu_\theta}(s, a)$ its action-value function. Then

$\nabla_\theta J(\mu_\theta) = \mathbb{E}_{s \sim d^{\mu_\theta}}\left[\nabla_\theta \mu_\theta(s) \, \nabla_a Q^{\mu_\theta}(s, a) \big|_{a = \mu_\theta(s)}\right].$

Intuition

For a stochastic policy, the gradient uses the score function $\nabla_\theta \log \pi_\theta(a \mid s)$ . For a deterministic policy there is no randomness over actions, so the gradient flows through the action: chain rule pushes the gradient from $Q$ backwards through $\mu_\theta$ . Conceptually you ask "which way should I nudge the policy so that $Q$ goes up at the action it actually takes?"

Proof Sketch

Silver et al. (2014) prove this by differentiating $J(\mu_\theta) = \int d^{\mu_\theta}(s) \, Q^{\mu_\theta}(s, \mu_\theta(s)) \, ds$ and showing that the term involving $\nabla_\theta d^{\mu_\theta}$ vanishes, leaving the expression above. This is the deterministic analogue of the policy gradient theorem and relies on the same visitation-distribution telescoping.

Why It Matters

This is the theoretical basis of every continuous-control actor-critic algorithm (DDPG, TD3, SAC actor updates). Instead of sampling actions and using a score-function estimator with high variance, the gradient goes straight through $Q$ via the reparameterization-style chain rule. Much lower variance than a stochastic-policy gradient.

Failure Mode

The theorem requires $\nabla_a Q(s, a)$ to be informative at $a = \mu_\theta(s)$ . If $Q$ has sharp peaks or is badly fit, the gradient can point in arbitrary directions. In practice this is why DDPG is brittle to critic quality.

report a correction →

The DDPG Algorithm

DDPG maintains four networks: actor $\mu_\theta$ , critic $Q_\phi$ , and slow-moving target copies $\mu_{\theta'}$ and $Q_{\phi'}$ .

Definition

Polyak (exponential moving) averaging $θ^{'} \leftarrow τ θ + (1 - τ) θ^{'}$

Target network parameters are updated with a small step $\tau \in (0, 1]$ (typically $\tau = 0.005$ ) toward the online parameters each training step. This produces a smoothly lagging target instead of a hard copy every $N$ steps as in DQN. The goal is the same: decouple the bootstrap target from the parameters being updated, preventing runaway feedback.

Each training step does the following:

Collect a transition $(s_t, a_t, r_t, s_{t+1})$ by acting with $\mu_\theta(s_t) + \epsilon_t$ where $\epsilon_t$ is exploration noise (originally Ornstein-Uhlenbeck, later Gaussian).
Store the transition in a replay buffer $\mathcal{D}$ .
Sample a minibatch $B \subset \mathcal{D}$ uniformly.
Critic update: minimize the mean-squared Bellman error

$L_Q(\phi) = \mathbb{E}_{(s, a, r, s') \sim B}\left[\left(Q_\phi(s, a) - y\right)^2\right], \quad y = r + \gamma Q_{\phi'}(s', \mu_{\theta'}(s')).$

Actor update: ascend the deterministic policy gradient

$\nabla_\theta J \approx \mathbb{E}_{s \sim B}\left[\nabla_\theta \mu_\theta(s) \, \nabla_a Q_\phi(s, a) \big|_{a = \mu_\theta(s)}\right].$

Polyak-update both target networks: $\phi' \leftarrow \tau \phi + (1 - \tau) \phi'$ and similarly for $\theta'$ .

Everything is off-policy: the replay buffer mixes data from many past policies, and the deterministic policy gradient remains valid because the action gradient is evaluated at $\mu_\theta(s)$ , not at the replayed action.

Why DDPG is Fragile

Watch Out

Overestimation bias in the critic

The critic target $y = r + \gamma Q_{\phi'}(s', \mu_{\theta'}(s'))$ does not contain an explicit $\max$ over actions, so people often think DDPG avoids the DQN overestimation bias. It does not. Any noise in $Q_\phi$ gets exploited by the actor, which climbs up the noisy gradient until it finds actions where $Q_\phi$ is spuriously high. The critic then fits those bogus targets and the error compounds. This is the failure mode that motivated clipped-double-Q learning in TD3.

Watch Out

Exploration noise is not a regularizer

Adding Gaussian or OU noise to the actor at execution time is an exploration mechanism, not a regularizer on the policy. It does not prevent the critic from overfitting to sharp peaks in action space, because the target $y$ uses the noiseless target action $\mu_{\theta'}(s')$ . TD3's target policy smoothing fixes this by adding clipped noise to the target action, which is a different fix.

Practical Notes

Hyperparameter brittleness: DDPG results in the literature vary wildly across seeds. Henderson et al. (2018), Deep Reinforcement Learning that Matters, documented this reproducibility crisis and used DDPG as their lead example.
Replay buffer size: typically $10^6$ . Smaller buffers cause catastrophic forgetting; larger ones slow adaptation after the policy improves.
Reward scaling: DDPG is extremely sensitive to the reward scale because the critic targets are not normalized. Either scale rewards into $[-1, 1]$ or use a target network update rate tuned to the reward magnitude.
Action scaling: the actor usually ends in a $\tanh$ so outputs lie in $[-1, 1]$ , then the environment-specific action bounds are applied by affine rescaling.

ExerciseCore

Problem

Write down the critic target $y$ used by DDPG and identify exactly which parameters are frozen (from the target networks) and which are trainable. Why would using $Q_\phi$ and $\mu_\theta$ on the right-hand side instead of $Q_{\phi'}$ and $\mu_{\theta'}$ cause training to diverge?

ExerciseAdvanced

Problem

The deterministic policy gradient theorem assumes $\mu_\theta$ is differentiable in $\theta$ and $Q$ is differentiable in $a$ . Construct a simple continuous-action MDP where $Q^{\mu}(s, \cdot)$ has a flat region around the optimal action. What happens to the DDPG actor update on that MDP, and which assumption of the theorem is being stressed?

Relation to Other Algorithms

DDPG sits between DQN and TD3/SAC in the continuous-control lineage:

DQN (Mnih et al. 2015) gave the replay-buffer + target-network recipe for discrete actions.
DPG (Silver et al. 2014) gave the deterministic policy gradient theorem.
DDPG (Lillicrap et al. 2016) combined the two for continuous actions.
TD3 (Fujimoto et al. 2018) added clipped double-Q, target policy smoothing, and delayed policy updates to fix DDPG's overestimation bias and brittleness. See the TD3 page.
SAC (Haarnoja et al. 2018) replaced the deterministic policy with a stochastic one maximizing entropy-regularized return, trading the clean DPG form for much stronger exploration and robustness.

References

Lillicrap, T. P. et al. (2016). Continuous Control with Deep Reinforcement Learning. ICLR 2016. The DDPG paper.
Silver, D. et al. (2014). Deterministic Policy Gradient Algorithms. ICML 2014. The DPG theorem.
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature 518. DQN recipe.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press. Chapter 13 on policy gradient methods.
Achiam, J. (2018). Spinning Up in Deep RL. OpenAI. The DDPG chapter and pseudocode.
Henderson, P. et al. (2018). Deep Reinforcement Learning that Matters. AAAI. The reproducibility study using DDPG.
Fujimoto, S. et al. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML 2018. TD3 paper that diagnoses DDPG failure modes.

Next Topics

TD3: the direct successor that fixes DDPG's overestimation and brittleness.
Policy Optimization: PPO and TRPO: on-policy trust-region alternative.
Actor-Critic Methods: the broader family DDPG belongs to.

Last reviewed: April 17, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Q-Learninglayer 2 · tier 1
Policy Gradient Theoremlayer 3 · tier 1
Actor-Critic Methodslayer 3 · tier 2

Derived topics

2

Policy Optimization: PPO and TRPOlayer 3 · tier 2
TD3: Twin Delayed Deep Deterministic Policy Gradientlayer 3 · tier 2

Graph-backed continuations

TD3: Twin Delayed Deep Deterministic Policy Gradient Policy Optimization: PPO and TRPO