TD3: Twin Delayed Deep Deterministic Policy Gradient

Sneiderman, Robby

RL Theory

TD3: Twin Delayed Deep Deterministic Policy Gradient

An off-policy actor-critic algorithm that fixes DDPG's overestimation bias with clipped double-Q learning, target policy smoothing, and delayed policy updates. The minimum-complexity robust continuous-control algorithm.

AdvancedTier 2StableSupporting~40 min

Prerequisites

Ddpg Q Learning Policy Gradient Theorem

Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 3 | tier 2. This page has 3 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Policy Optimization: PPO and TRPO

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

DDPG was the first deep RL algorithm to train on continuous-control benchmarks, but it is fragile: small hyperparameter changes swing returns by an order of magnitude, and on many tasks it diverges silently because the critic overestimates $Q$ -values and the actor then exploits the bogus peaks. TD3 (Fujimoto, van Hoof, Meger, 2018, Addressing Function Approximation Error in Actor-Critic Methods) is the minimum set of patches that turns DDPG into a reliable algorithm.

TD3 is worth studying because each of its three tricks targets a specific identified failure mode. The paper is a rare example in deep RL of a clean, diagnosed problem paired with a fix that generalizes. Clipped double-Q, in particular, is now standard in SAC, REDQ, and most continuous-control algorithms.

The Three Tricks

TD3 is DDPG plus three modifications. Each addresses a distinct source of error.

1. Clipped Double-Q Learning

Definition

Clipped Double-Q Target $y = r + γ min_{i = 1, 2} Q_{ϕ_{i}^{'}} (s^{'}, \tilde{a})$

Maintain two independent critics $Q_{\phi_1}, Q_{\phi_2}$ and their target copies. Both critics are trained against the same target, defined as the minimum of the two target-critic estimates at the next state:

$y = r + \gamma \min_{i = 1, 2} Q_{\phi'_i}(s', \tilde a).$

Here $\tilde a$ is the target action with smoothing applied (next trick).

Proposition

Minimum of Two Unbiased Estimators is Downward-Biased

Statement

Let $\hat Q_1, \hat Q_2$ be independent real-valued random variables with $\mathbb{E}[\hat Q_i] = Q$ and $\mathrm{Var}(\hat Q_i) > 0$ . Then

$\mathbb{E}[\min(\hat Q_1, \hat Q_2)] < Q.$

If both estimators are $\mathcal{N}(Q, \sigma^2)$ , the bias is exactly $-\sigma / \sqrt{\pi}$ .

Intuition

Taking the minimum of two noisy estimates is a deterministic downward transformation that cannot be undone by independence or unbiasedness of the components. In TD3, that downward bias in the bootstrap target cancels the usual upward overestimation bias of function-approximation $Q$ -learning, where the actor greedily exploits whichever direction the critic is currently optimistic.

Proof Sketch

Write $\min(a, b) = \frac{1}{2}(a + b) - \frac{1}{2}|a - b|$ . Taking expectations, $\mathbb{E}[\min(\hat Q_1, \hat Q_2)] = Q - \frac{1}{2}\mathbb{E}|\hat Q_1 - \hat Q_2|$ . The absolute-difference term is strictly positive whenever the variance is positive, giving strict inequality. For $\mathcal{N}(Q, \sigma^2)$ iid, $\hat Q_1 - \hat Q_2 \sim \mathcal{N}(0, 2\sigma^2)$ , so $\mathbb{E}|\hat Q_1 - \hat Q_2| = 2\sigma / \sqrt{\pi}$ , giving the stated bias.

Why It Matters

This is the theoretical basis for clipped double-Q. The bias is controllable and predictable rather than accidental. It is why Fujimoto et al. choose the minimum over more elaborate double-estimator schemes: simple, always-on pessimism that scales with critic noise.

Failure Mode

If the two critics become correlated (e.g., share too many gradient updates or an initialization artifact), the downward bias shrinks toward zero, and the trick silently stops working. The TD3 paper keeps $Q_{\phi_1}, Q_{\phi_2}$ fully separate for this reason. Also, if $Q$ itself is genuinely underestimated by the function class, the min pushes the target estimate further below the true $Q^\mu$ and slows learning.

report a correction →

Van Hasselt's original double Q-learning (2010) uses two estimators to decouple action selection from value evaluation. TD3's clipped double-Q is simpler: it just takes the minimum. Fujimoto et al. show that in the function-approximation regime, this deterministic pessimism is what actually helps, not the decoupling per se.

Watch Out

Why min, not average?

Averaging two noisy critics reduces variance but not bias. The actor climbs whichever critic is currently optimistic, so any positive bias gets amplified into the policy. Taking the min provides a lower bound on $Q$ that the actor still follows, and the pessimism is a feature: it slows down exploitation of high-variance regions the critic has not yet fit reliably.

2. Target Policy Smoothing

Definition

Target Policy Smoothing $\tilde{a} = clip (μ_{θ^{'}} (s^{'}) + ϵ, a_{low}, a_{high})$

The target action is not simply $\mu_{\theta'}(s')$ but a noisy version,

$\tilde a = \text{clip}\bigl(\mu_{\theta'}(s') + \text{clip}(\epsilon, -c, c), \, a_{\text{low}}, \, a_{\text{high}}\bigr), \quad \epsilon \sim \mathcal{N}(0, \sigma^2).$

The noise is clipped to $[-c, c]$ to keep it a local regularizer (typical values $\sigma = 0.2$ , $c = 0.5$ ).

This implements a form of bootstrapping regularization: we refuse to assume that the critic's sharp peak at $\mu_{\theta'}(s')$ is real. By averaging $Q_{\phi'_i}$ over a small neighborhood of the target action, we smooth away narrow spikes that are almost always artifacts of function approximation. The regularizer is on the target action inside the bootstrap, not on the policy during execution.

3. Delayed Policy Updates

The actor and targets are updated only every $d$ critic updates (typically $d = 2$ ). Rationale: the actor update uses $\nabla_a Q_{\phi_1}(s, a) \big|_{a = \mu_\theta(s)}$ , which is only meaningful if $Q_{\phi_1}$ is a reasonable estimate of $Q^{\mu_\theta}$ . Updating the actor against a critic that is still badly wrong just pushes the policy into regions where the critic has even less data. Letting the critic settle first means the actor follows a steadier gradient. Only $Q_{\phi_1}$ (not the $\min$ ) is used for the actor gradient.

The Full TD3 Update

Per environment step:

Act with $a = \mu_\theta(s) + \epsilon$ , $\epsilon \sim \mathcal{N}(0, \sigma_{\text{exp}}^2)$ ; store $(s, a, r, s')$ in $\mathcal{D}$ .
Sample a minibatch $B \subset \mathcal{D}$ .
Compute the smoothed target action $\tilde a$ and the clipped double-Q target $y$ .
Update both critics: $\phi_i \leftarrow \phi_i - \eta \nabla_{\phi_i} \mathbb{E}_B[(Q_{\phi_i}(s, a) - y)^2]$ for $i = 1, 2$ .
Every $d$ $d$ steps:
- Actor update: $\theta \leftarrow \theta + \eta \, \mathbb{E}_B[\nabla_\theta \mu_\theta(s) \, \nabla_a Q_{\phi_1}(s, a)\big|_{a = \mu_\theta(s)}]$ .
- Polyak-update all three target networks: $\phi'_i \leftarrow \tau \phi_i + (1 - \tau) \phi'_i$ , $\theta' \leftarrow \tau \theta + (1 - \tau) \theta'$ .

Ablations

The TD3 paper runs a clean ablation: each trick is added one at a time to DDPG. The headline result:

DDPG alone: often diverges, wide seed variance.
- clipped double-Q (CDQ): the biggest single fix. Cuts overestimation bias by an order of magnitude.
- target policy smoothing (TPS): smaller but consistent improvement, and necessary in narrow-reward environments.
- delayed updates (DPU): mostly reduces variance across seeds rather than lifting mean performance.

Watch Out

TD3 is not SAC

Both TD3 and SAC are off-policy continuous-control actor-critic algorithms, and both use twin critics with a min target. They are not the same algorithm. SAC uses a stochastic policy maximizing entropy-regularized return; the policy has a reparameterized sample, and the actor loss includes an entropy bonus. TD3's policy is deterministic, and exploration noise is injected externally at execution. SAC is often more robust out of the box because the entropy bonus handles exploration and plateau escape that TD3 has to solve with external noise.

ExerciseCore

Problem

Explain, without just restating the paper, why taking $\min$ of two critics biases the target downward even when both critics are individually unbiased estimators of $Q^\mu$ .

ExerciseAdvanced

Problem

TD3 uses only $Q_{\phi_1}$ (not the min of both) for the actor gradient. Why not use the min there too, for consistency? What would go wrong?

What TD3 Did Not Fix

Exploration in sparse-reward tasks: TD3 still relies on external Gaussian noise. It will not solve Montezuma's Revenge style exploration problems.
Sample efficiency on pixels: TD3 was designed and tested on low-dimensional state vectors. Pixel-based continuous control needs additional machinery (DrQ, CURL, representation learning objectives).
Continuous-discrete hybrid actions: TD3 assumes fully continuous actions with gradients flowing through $\mu_\theta$ . Hybrid action spaces typically go to SAC variants or decomposed architectures.

References

Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML 2018. The TD3 paper.
Lillicrap, T. P. et al. (2016). Continuous Control with Deep Reinforcement Learning. ICLR 2016. The DDPG predecessor.
van Hasselt, H. (2010). Double Q-learning. NeurIPS 2010. The original double-estimator idea.
van Hasselt, H., Guez, A., and Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI 2016. Double DQN.
Haarnoja, T. et al. (2018). Soft Actor-Critic. ICML 2018. The entropy-regularized continuous-control algorithm that adopted twin critics from TD3.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press. Chapter 13.
Achiam, J. (2018). Spinning Up in Deep RL. OpenAI. The TD3 chapter and reference implementation.

Next Topics

DDPG: the predecessor TD3 patches.
Policy Optimization: PPO and TRPO: the on-policy alternative to TD3-style off-policy methods.
Actor-Critic Methods: the parent family.

Last reviewed: April 17, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Q-Learninglayer 2 · tier 1
Policy Gradient Theoremlayer 3 · tier 1
DDPG: Deep Deterministic Policy Gradientlayer 3 · tier 2

Derived topics

1

Policy Optimization: PPO and TRPOlayer 3 · tier 2

Graph-backed continuations

Policy Optimization: PPO and TRPO