What Each Paradigm Does
Model-based and model-free RL are two strategies for solving sequential decision problems. Both aim to find a policy that maximizes expected cumulative reward . The difference is whether the agent builds an explicit model of the environment.
Model-based: Learn a dynamics model and reward model from experience, then use them to plan (simulate trajectories, search over action sequences, or compute values via dynamic programming).
Model-free: Learn a value function or policy directly from environment interactions, without ever explicitly modeling how the environment works.
Side-by-Side Formulation
Model-Based RL
The agent maintains a learned model and uses it for planning. The planning step solves:
This can be done via:
- Shooting methods: sample action sequences, evaluate under , pick the best (CEM, MPPI)
- Dynamic programming: compute value functions using (Dyna, value iteration on the model)
- Learned planning: encode the model into a latent space and plan there (Dreamer, MuZero)
Model-Free RL
The agent directly approximates the optimal value function or policy without a dynamics model.
Value-based (DQN): Learn satisfying the Bellman optimality equation:
Policy-based (PPO, REINFORCE): Directly optimize the policy parameters:
using policy gradient estimates from sampled trajectories.
Where Each Is Stronger
Model-based wins on sample efficiency
A learned model can generate unlimited synthetic experience. Instead of interacting with the real environment for every training sample, the agent can "imagine" trajectories and learn from them. This is particularly valuable when real interactions are expensive:
- Dreamer v3 achieves strong Atari performance with 10 to 100x fewer environment frames than model-free methods
- MuZero learns to play Go, chess, and Atari by planning with a learned latent model, using orders of magnitude fewer games than AlphaGo
- In robotics, model-based methods can learn locomotion from minutes of real data by training a dynamics model and planning in simulation
Model-free wins on asymptotic performance
Model-free methods have no model to be wrong. They learn directly from real transitions, so their performance is limited only by function approximation error and exploration, not by model error. In environments where:
- The dynamics are highly complex or stochastic
- The model class cannot capture the true dynamics
- Abundant experience is available (fast simulators, parallel environments)
model-free methods often achieve higher final performance because they avoid the compounding errors of an imperfect model.
The Model Error Problem
Compounding Model Error
Statement
If the learned model has per-step prediction error , then the total variation distance between the real and model-predicted state distributions after steps satisfies:
The value function error under the model-based policy is bounded by:
where is the policy optimal under the learned model.
Intuition
Small per-step errors accumulate linearly with the planning horizon. Over a 50-step plan, even 1% per-step error becomes 50% total error. This is why model-based methods struggle with long-horizon tasks unless the model is very accurate or planning happens in a learned latent space where errors are more controlled.
Key Assumptions That Differ
| Model-Based | Model-Free | |
|---|---|---|
| What is learned | Dynamics model $\hat(s' | s,a)$ + policy/value |
| Data usage | Generate synthetic data from model | Use only real experience |
| Sample efficiency | High (imagination augments real data) | Low (every sample is real) |
| Asymptotic performance | Limited by model accuracy | Limited by function approximation |
| Computational cost | High (model learning + planning) | Lower (no planning step) |
| Key algorithms | Dreamer, MuZero, MBPO, PETS | DQN, PPO, SAC, TD3 |
The Dyna Architecture: Bridging Both
The Dyna framework combines model-based and model-free learning:
- Interact with the real environment, store transitions
- Update a model-free value function from real transitions
- Learn a dynamics model from the same transitions
- Generate synthetic transitions from the model
- Update the value function from synthetic transitions
Steps 4 to 5 can be repeated many times per real transition, amplifying sample efficiency. MBPO (Model-Based Policy Optimization) is a modern incarnation: it uses a learned ensemble of dynamics models to generate short synthetic rollouts, then trains SAC on the combined real and synthetic data.
When a Practitioner Would Use Each
Board games and strategic planning
Use model-based (MuZero). The dynamics of board games are deterministic or low-stochasticity, and the planning horizon is long. MuZero learns a latent dynamics model and uses Monte Carlo tree search to plan, achieving superhuman performance in Go, chess, and shogi without knowing the rules.
Atari with unlimited simulator access
Use model-free (PPO or Rainbow DQN). The Atari simulator is fast and free to run. The visual dynamics are complex and hard to model accurately. Model-free methods achieve top performance by simply running billions of frames through the simulator.
Real robot learning with limited trials
Use model-based. Each real-world trial takes time, causes wear, and risks damage. Learning a dynamics model from 10 minutes of interaction and then planning with that model is far more practical than running PPO for millions of steps on a physical robot.
Continuous control in MuJoCo
Use hybrid (MBPO or Dreamer). These environments have smooth, learnable dynamics but also benefit from model-free refinement. MBPO uses model-generated data to warm-start a model-free algorithm, getting the sample efficiency of model-based methods and the asymptotic performance of model-free methods.
Common Confusions
Model-based does not mean you need a pixel-level predictor
Early model-based RL tried to predict future observations (e.g., next video frame). This is unnecessary and wasteful. Modern methods like MuZero and Dreamer learn dynamics in a compact latent space. The model predicts latent states and rewards, not raw observations. This avoids modeling irrelevant visual details.
Model-free methods do have implicit models
A well-trained Q-function implicitly encodes knowledge about the dynamics. reflects the expected future rewards from taking action in state , which requires "knowing" what states follow. The difference is that this knowledge is implicit in the value function, not available as an explicit simulator that can be queried for arbitrary state-action pairs.
Sample efficiency is not the only cost
Model-based methods use fewer environment interactions but more computation (model training, planning). In fast simulators, environment interactions are cheap and computation is the bottleneck. In that regime, model-free methods may be faster in wall-clock time despite using more samples.