Applied ML
Deep RL for Control
DDPG, TD3, and SAC for continuous control, the sim-to-real gap, domain randomization, the MuJoCo benchmark history, and why model-based methods (PETS, Dreamer) are closing the sample-efficiency gap on real-robot deployments.
Prerequisites
Why This Matters
Continuous-control RL is the regime where the action space is a vector in (joint torques, motor voltages, thrust commands) rather than a discrete menu. Q-learning does not directly apply because the over a continuous set is itself an optimization. Deterministic and entropy-regularized actor-critic methods sidestep this by parameterizing the policy and following the deterministic or stochastic policy gradient.
The MuJoCo physics-engine benchmarks (HalfCheetah, Hopper, Walker2d, Ant, Humanoid) became the standard testbed after 2016, and the line of work DDPG to TD3 to SAC tracks the field's understanding of what makes off-policy continuous control stable. The sim-to-real problem is the practical reason this matters: a policy that cleans up MuJoCo does not survive contact with a real robot unless the training distribution covers the physical-world variation.
Core Ideas
DDPG (Lillicrap et al., ICLR 2016, arXiv:1509.02971) extends DQN to continuous actions with a deterministic actor and critic , updated via the deterministic policy gradient . DDPG is famously brittle: small hyperparameter changes flip success and failure, and the critic systematically overestimates .
TD3 (Fujimoto, van Hoof, Meger, ICML 2018, arXiv:1802.09477) fixes three failure modes: (1) clipped double-Q learning takes the minimum of two critics to fight overestimation, (2) delayed actor updates run the critic for several steps before each actor step, and (3) target policy smoothing adds noise to the target action so the critic does not exploit sharp peaks. The result is a much less fragile algorithm at the same sample budget.
SAC (Haarnoja et al., ICML 2018, arXiv:1801.01290) replaces the deterministic actor with a stochastic Gaussian policy and adds an entropy bonus to the reward, with tuned automatically to hit a target entropy. Maximum-entropy RL handles exploration through the policy itself and produces policies that are more robust to perturbation. SAC is the off-policy continuous-control workhorse as of 2026.
The sim-to-real gap is the failure of a policy trained on a simulator to perform on hardware. The two dominant fixes are system identification (fit the simulator parameters to the real robot) and domain randomization (Tobin et al., IROS 2017, arXiv:1703.06907), which trains over a distribution of simulator parameters wide enough to contain reality. OpenAI's in-hand Rubik's cube manipulation (2019) and ANYmal locomotion (Hwangbo et al., Science Robotics 2019) are the standard demonstrations.
Model-based RL (PETS, Chua et al. NeurIPS 2018; Dreamer family, Hafner et al. 2020-2024) learns a forward dynamics model and plans or trains a policy in the imagined rollouts. On many MuJoCo tasks, Dreamer-V3 reaches SAC-level returns with 10x to 100x fewer environment steps. The trade is that the learned model can be wrong in ways the planner exploits, so model-based methods need uncertainty estimates or short-horizon rollouts.
Common Confusions
DDPG and TD3 are off-policy, but the replay buffer does not make them safe
Off-policy training reuses old transitions, but the policy still chooses actions on the real (or simulated) environment during exploration. Bad initial policies can damage hardware before the replay buffer is full. Pretraining on demonstrations or in simulation is standard practice for real-robot work.
Domain randomization is not the same as data augmentation
Augmentation adds perturbations to a fixed dataset. Domain randomization changes the simulator dynamics during training, so the policy interacts with a moving environment. The policy must therefore be robust, not just the perception module.
References
Related Topics
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- Actor-Critic MethodsLayer 3
- Policy Gradient TheoremLayer 3
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Q-LearningLayer 2
- Value Iteration and Policy IterationLayer 2