Deep RL for Control

Sneiderman, Robby

Applied ML

Deep RL for Control

DDPG, TD3, and SAC for continuous control, the sim-to-real gap, domain randomization, the MuJoCo benchmark history, and why model-based methods (PETS, Dreamer) are closing the sample-efficiency gap on real-robot deployments.

AdvancedTier 3CurrentReference~15 min

Prerequisites

Actor Critic Methods Policy Gradient Theorem

Prereq Map

Learning position

Read this page in the graph.

applied-ml | layer 4 | tier 3. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Model-Based Reinforcement Learning

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Continuous-control RL is the regime where the action space is a vector in $\mathbb{R}^d$ (joint torques, motor voltages, thrust commands) rather than a discrete menu. Q-learning does not directly apply because the $\arg\max_a Q(s, a)$ over a continuous set is itself an optimization. Deterministic and entropy-regularized actor-critic methods sidestep this by parameterizing the policy and following the deterministic or stochastic policy gradient.

The MuJoCo physics-engine benchmarks (HalfCheetah, Hopper, Walker2d, Ant, Humanoid) became the standard testbed after 2016, and the line of work DDPG to TD3 to SAC tracks the field's understanding of what makes off-policy continuous control stable. The sim-to-real problem is the practical reason this matters: a policy that cleans up MuJoCo does not survive contact with a real robot unless the training distribution covers the physical-world variation.

Core Ideas

DDPG (Lillicrap et al., ICLR 2016, arXiv:1509.02971) extends DQN to continuous actions with a deterministic actor $\mu_\theta(s)$ and critic $Q_\phi(s, a)$ , updated via the deterministic policy gradient $\nabla_\theta J = \mathbb{E}_s[\nabla_a Q_\phi(s, a)|_{a = \mu_\theta(s)} \nabla_\theta \mu_\theta(s)]$ . DDPG is famously brittle: small hyperparameter changes flip success and failure, and the critic systematically overestimates $Q$ .

TD3 (Fujimoto, van Hoof, Meger, ICML 2018, arXiv:1802.09477) fixes three failure modes: (1) clipped double-Q learning takes the minimum of two critics to fight overestimation, (2) delayed actor updates run the critic for several steps before each actor step, and (3) target policy smoothing adds noise to the target action so the critic does not exploit sharp $Q$ peaks. The result is a much less fragile algorithm at the same sample budget.

SAC (Haarnoja et al., ICML 2018, arXiv:1801.01290) replaces the deterministic actor with a stochastic Gaussian policy and adds an entropy bonus $\alpha \mathcal{H}(\pi(\cdot|s))$ to the reward, with $\alpha$ tuned automatically to hit a target entropy. Maximum-entropy RL handles exploration through the policy itself and produces policies that are more robust to perturbation. SAC is the off-policy continuous-control workhorse as of 2026.

The sim-to-real gap is the failure of a policy trained on a simulator to perform on hardware. The two dominant fixes are system identification (fit the simulator parameters to the real robot) and domain randomization (Tobin et al., IROS 2017, arXiv:1703.06907), which trains over a distribution of simulator parameters wide enough to contain reality. OpenAI's in-hand Rubik's cube manipulation (2019) and ANYmal locomotion (Hwangbo et al., Science Robotics 2019) are the standard demonstrations.

Model-based RL (PETS, Chua et al. NeurIPS 2018; Dreamer family, Hafner et al. 2020-2024) learns a forward dynamics model and plans or trains a policy in the imagined rollouts. On many MuJoCo tasks, Dreamer-V3 reaches SAC-level returns with 10x to 100x fewer environment steps. The trade is that the learned model can be wrong in ways the planner exploits, so model-based methods need uncertainty estimates or short-horizon rollouts.

Common Confusions

Watch Out

DDPG and TD3 are off-policy, but the replay buffer does not make them safe

Off-policy training reuses old transitions, but the policy still chooses actions on the real (or simulated) environment during exploration. Bad initial policies can damage hardware before the replay buffer is full. Pretraining on demonstrations or in simulation is standard practice for real-robot work.

Watch Out

Domain randomization is not the same as data augmentation

Augmentation adds perturbations to a fixed dataset. Domain randomization changes the simulator dynamics during training, so the policy interacts with a moving environment. The policy must therefore be robust, not just the perception module.

References

Lillicrap et al., "Continuous control with deep reinforcement learning," ICLR 2016, arXiv:1509.02971
Fujimoto, van Hoof, Meger, "Addressing Function Approximation Error in Actor-Critic Methods," ICML 2018, arXiv:1802.09477
Haarnoja et al., "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor," ICML 2018, arXiv:1801.01290
Tobin et al., "Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World," IROS 2017, arXiv:1703.06907
Chua, Calandra, McAllister, Levine, "Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models," NeurIPS 2018, arXiv:1805.12114
Hafner et al., "Mastering Diverse Domains through World Models," 2023, arXiv:2301.04104

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics