Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Model-Based vs. Model-Free RL

Model-based RL learns a dynamics model and plans internally (Dreamer, MuZero), while model-free RL learns value functions or policies directly from experience (DQN, PPO). The tradeoff is sample efficiency vs. model error.

What Each Paradigm Does

Model-based and model-free RL are two strategies for solving sequential decision problems. Both aim to find a policy π\pi that maximizes expected cumulative reward E[tγtrt]\mathbb{E}[\sum_t \gamma^t r_t]. The difference is whether the agent builds an explicit model of the environment.

Model-based: Learn a dynamics model p^(ss,a)\hat{p}(s'|s,a) and reward model r^(s,a)\hat{r}(s,a) from experience, then use them to plan (simulate trajectories, search over action sequences, or compute values via dynamic programming).

Model-free: Learn a value function Q(s,a)Q(s,a) or policy π(as)\pi(a|s) directly from environment interactions, without ever explicitly modeling how the environment works.

Side-by-Side Formulation

Definition

Model-Based RL

The agent maintains a learned model M^=(p^,r^)\hat{M} = (\hat{p}, \hat{r}) and uses it for planning. The planning step solves:

π=argmaxπEM^,π ⁣[t=0Hγtr^(st,at)]\pi^* = \arg\max_\pi \mathbb{E}_{\hat{M},\pi}\!\left[\sum_{t=0}^H \gamma^t \hat{r}(s_t, a_t)\right]

This can be done via:

  • Shooting methods: sample action sequences, evaluate under M^\hat{M}, pick the best (CEM, MPPI)
  • Dynamic programming: compute value functions using p^\hat{p} (Dyna, value iteration on the model)
  • Learned planning: encode the model into a latent space and plan there (Dreamer, MuZero)
Definition

Model-Free RL

The agent directly approximates the optimal value function or policy without a dynamics model.

Value-based (DQN): Learn Q(s,a)Q^*(s,a) satisfying the Bellman optimality equation:

Q(s,a)=E ⁣[r+γmaxaQ(s,a)]Q^*(s,a) = \mathbb{E}\!\left[r + \gamma \max_{a'} Q^*(s', a')\right]

Policy-based (PPO, REINFORCE): Directly optimize the policy parameters:

θ=argmaxθEπθ ⁣[tγtrt]\theta^* = \arg\max_\theta \mathbb{E}_{\pi_\theta}\!\left[\sum_t \gamma^t r_t\right]

using policy gradient estimates from sampled trajectories.

Where Each Is Stronger

Model-based wins on sample efficiency

A learned model can generate unlimited synthetic experience. Instead of interacting with the real environment for every training sample, the agent can "imagine" trajectories and learn from them. This is particularly valuable when real interactions are expensive:

Model-free wins on asymptotic performance

Model-free methods have no model to be wrong. They learn directly from real transitions, so their performance is limited only by function approximation error and exploration, not by model error. In environments where:

model-free methods often achieve higher final performance because they avoid the compounding errors of an imperfect model.

The Model Error Problem

Proposition

Compounding Model Error

Statement

If the learned model p^\hat{p} has per-step prediction error ϵ=maxs,aDTV(p^(s,a),p(s,a))\epsilon = \max_{s,a}\, D_{\text{TV}}(\hat{p}(\cdot|s,a),\, p(\cdot|s,a)), then the total variation distance between the real and model-predicted state distributions after HH steps satisfies:

DTV(pH,p^H)HϵD_{\text{TV}}(p_H, \hat{p}_H) \leq H\epsilon

The value function error under the model-based policy is bounded by:

VπVπ^2γϵrmax(1γ)2|V^{\pi^*} - V^{\hat{\pi}}| \leq \frac{2\gamma\epsilon\, r_{\max}}{(1-\gamma)^2}

where π^\hat{\pi} is the policy optimal under the learned model.

Intuition

Small per-step errors accumulate linearly with the planning horizon. Over a 50-step plan, even 1% per-step error becomes 50% total error. This is why model-based methods struggle with long-horizon tasks unless the model is very accurate or planning happens in a learned latent space where errors are more controlled.

Key Assumptions That Differ

Model-BasedModel-Free
What is learnedDynamics model $\hat(s's,a)$ + policy/value
Data usageGenerate synthetic data from modelUse only real experience
Sample efficiencyHigh (imagination augments real data)Low (every sample is real)
Asymptotic performanceLimited by model accuracyLimited by function approximation
Computational costHigh (model learning + planning)Lower (no planning step)
Key algorithmsDreamer, MuZero, MBPO, PETSDQN, PPO, SAC, TD3

The Dyna Architecture: Bridging Both

The Dyna framework combines model-based and model-free learning:

  1. Interact with the real environment, store transitions
  2. Update a model-free value function from real transitions
  3. Learn a dynamics model from the same transitions
  4. Generate synthetic transitions from the model
  5. Update the value function from synthetic transitions

Steps 4 to 5 can be repeated many times per real transition, amplifying sample efficiency. MBPO (Model-Based Policy Optimization) is a modern incarnation: it uses a learned ensemble of dynamics models to generate short synthetic rollouts, then trains SAC on the combined real and synthetic data.

When a Practitioner Would Use Each

Example

Board games and strategic planning

Use model-based (MuZero). The dynamics of board games are deterministic or low-stochasticity, and the planning horizon is long. MuZero learns a latent dynamics model and uses Monte Carlo tree search to plan, achieving superhuman performance in Go, chess, and shogi without knowing the rules.

Example

Atari with unlimited simulator access

Use model-free (PPO or Rainbow DQN). The Atari simulator is fast and free to run. The visual dynamics are complex and hard to model accurately. Model-free methods achieve top performance by simply running billions of frames through the simulator.

Example

Real robot learning with limited trials

Use model-based. Each real-world trial takes time, causes wear, and risks damage. Learning a dynamics model from 10 minutes of interaction and then planning with that model is far more practical than running PPO for millions of steps on a physical robot.

Example

Continuous control in MuJoCo

Use hybrid (MBPO or Dreamer). These environments have smooth, learnable dynamics but also benefit from model-free refinement. MBPO uses model-generated data to warm-start a model-free algorithm, getting the sample efficiency of model-based methods and the asymptotic performance of model-free methods.

Common Confusions

Watch Out

Model-based does not mean you need a pixel-level predictor

Early model-based RL tried to predict future observations (e.g., next video frame). This is unnecessary and wasteful. Modern methods like MuZero and Dreamer learn dynamics in a compact latent space. The model predicts latent states and rewards, not raw observations. This avoids modeling irrelevant visual details.

Watch Out

Model-free methods do have implicit models

A well-trained Q-function implicitly encodes knowledge about the dynamics. Q(s,a)Q(s, a) reflects the expected future rewards from taking action aa in state ss, which requires "knowing" what states follow. The difference is that this knowledge is implicit in the value function, not available as an explicit simulator that can be queried for arbitrary state-action pairs.

Watch Out

Sample efficiency is not the only cost

Model-based methods use fewer environment interactions but more computation (model training, planning). In fast simulators, environment interactions are cheap and computation is the bottleneck. In that regime, model-free methods may be faster in wall-clock time despite using more samples.