Skip to main content

Applied ML

Deep RL for Control

DDPG, TD3, and SAC for continuous control, the sim-to-real gap, domain randomization, the MuJoCo benchmark history, and why model-based methods (PETS, Dreamer) are closing the sample-efficiency gap on real-robot deployments.

AdvancedTier 3Current~15 min
0

Why This Matters

Continuous-control RL is the regime where the action space is a vector in Rd\mathbb{R}^d (joint torques, motor voltages, thrust commands) rather than a discrete menu. Q-learning does not directly apply because the argmaxaQ(s,a)\arg\max_a Q(s, a) over a continuous set is itself an optimization. Deterministic and entropy-regularized actor-critic methods sidestep this by parameterizing the policy and following the deterministic or stochastic policy gradient.

The MuJoCo physics-engine benchmarks (HalfCheetah, Hopper, Walker2d, Ant, Humanoid) became the standard testbed after 2016, and the line of work DDPG to TD3 to SAC tracks the field's understanding of what makes off-policy continuous control stable. The sim-to-real problem is the practical reason this matters: a policy that cleans up MuJoCo does not survive contact with a real robot unless the training distribution covers the physical-world variation.

Core Ideas

DDPG (Lillicrap et al., ICLR 2016, arXiv:1509.02971) extends DQN to continuous actions with a deterministic actor μθ(s)\mu_\theta(s) and critic Qϕ(s,a)Q_\phi(s, a), updated via the deterministic policy gradient θJ=Es[aQϕ(s,a)a=μθ(s)θμθ(s)]\nabla_\theta J = \mathbb{E}_s[\nabla_a Q_\phi(s, a)|_{a = \mu_\theta(s)} \nabla_\theta \mu_\theta(s)]. DDPG is famously brittle: small hyperparameter changes flip success and failure, and the critic systematically overestimates QQ.

TD3 (Fujimoto, van Hoof, Meger, ICML 2018, arXiv:1802.09477) fixes three failure modes: (1) clipped double-Q learning takes the minimum of two critics to fight overestimation, (2) delayed actor updates run the critic for several steps before each actor step, and (3) target policy smoothing adds noise to the target action so the critic does not exploit sharp QQ peaks. The result is a much less fragile algorithm at the same sample budget.

SAC (Haarnoja et al., ICML 2018, arXiv:1801.01290) replaces the deterministic actor with a stochastic Gaussian policy and adds an entropy bonus αH(π(s))\alpha \mathcal{H}(\pi(\cdot|s)) to the reward, with α\alpha tuned automatically to hit a target entropy. Maximum-entropy RL handles exploration through the policy itself and produces policies that are more robust to perturbation. SAC is the off-policy continuous-control workhorse as of 2026.

The sim-to-real gap is the failure of a policy trained on a simulator to perform on hardware. The two dominant fixes are system identification (fit the simulator parameters to the real robot) and domain randomization (Tobin et al., IROS 2017, arXiv:1703.06907), which trains over a distribution of simulator parameters wide enough to contain reality. OpenAI's in-hand Rubik's cube manipulation (2019) and ANYmal locomotion (Hwangbo et al., Science Robotics 2019) are the standard demonstrations.

Model-based RL (PETS, Chua et al. NeurIPS 2018; Dreamer family, Hafner et al. 2020-2024) learns a forward dynamics model and plans or trains a policy in the imagined rollouts. On many MuJoCo tasks, Dreamer-V3 reaches SAC-level returns with 10x to 100x fewer environment steps. The trade is that the learned model can be wrong in ways the planner exploits, so model-based methods need uncertainty estimates or short-horizon rollouts.

Common Confusions

Watch Out

DDPG and TD3 are off-policy, but the replay buffer does not make them safe

Off-policy training reuses old transitions, but the policy still chooses actions on the real (or simulated) environment during exploration. Bad initial policies can damage hardware before the replay buffer is full. Pretraining on demonstrations or in simulation is standard practice for real-robot work.

Watch Out

Domain randomization is not the same as data augmentation

Augmentation adds perturbations to a fixed dataset. Domain randomization changes the simulator dynamics during training, so the policy interacts with a moving environment. The policy must therefore be robust, not just the perception module.

References

Related Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics