Model-Based vs. Model-Free RL. Planning vs. Direct Learning

What Each Paradigm Does

Model-based and model-free RL are two strategies for solving sequential decision problems. Both aim to find a policy $\pi$ that maximizes expected cumulative reward $\mathbb{E}[\sum_t \gamma^t r_t]$ . The difference is whether the agent builds an explicit model of the environment.

Model-based: Learn a dynamics model $\hat{p}(s'|s,a)$ and reward model $\hat{r}(s,a)$ from experience, then use them to plan (simulate trajectories, search over action sequences, or compute values via dynamic programming).

Model-free: Learn a value function $Q(s,a)$ or policy $\pi(a|s)$ directly from environment interactions, without ever explicitly modeling how the environment works.

Side-by-Side Formulation

Definition

Model-Based RL

The agent maintains a learned model $\hat{M} = (\hat{p}, \hat{r})$ and uses it for planning. The planning step solves:

$\pi^* = \arg\max_\pi \mathbb{E}_{\hat{M},\pi}\!\left[\sum_{t=0}^H \gamma^t \hat{r}(s_t, a_t)\right]$

This can be done via:

Shooting methods: sample action sequences, evaluate under $\hat{M}$ , pick the best (CEM, MPPI)
Dynamic programming: compute value functions using $\hat{p}$ (Dyna, value iteration on the model)
Learned planning: encode the model into a latent space and plan there (Dreamer, MuZero)

Definition

Model-Free RL

The agent directly approximates the optimal value function or policy without a dynamics model.

Value-based (DQN): Learn $Q^*(s,a)$ satisfying the Bellman optimality equation:

$Q^*(s,a) = \mathbb{E}\!\left[r + \gamma \max_{a'} Q^*(s', a')\right]$

Policy-based (PPO, REINFORCE): Directly optimize the policy parameters:

$\theta^* = \arg\max_\theta \mathbb{E}_{\pi_\theta}\!\left[\sum_t \gamma^t r_t\right]$

using policy gradient estimates from sampled trajectories.

Where Each Is Stronger

Model-based wins on sample efficiency

A learned model can generate unlimited synthetic experience. Instead of interacting with the real environment for every training sample, the agent can "imagine" trajectories and learn from them. This is particularly valuable when real interactions are expensive:

DreamerV3 reports strong performance across Atari and other domains by learning from imagined latent rollouts, reducing real-environment interaction in settings where the learned model stays accurate enough
MuZero learns to play Go, chess, and Atari by planning with a learned latent model, showing that planning can work even when the model is not a hand-coded rule simulator
In robotics, model-based methods can learn locomotion from minutes of real data by training a dynamics model and planning in simulation

Model-free wins on asymptotic performance

Model-free methods have no model to be wrong. They learn directly from real transitions, so their performance is limited only by function approximation error and exploration, not by model error. In environments where:

The dynamics are highly complex or stochastic
The model class cannot capture the true dynamics
Abundant experience is available (fast simulators, parallel environments)

model-free methods often achieve higher final performance because they avoid the compounding errors of an imperfect model.

The Model Error Problem

Proposition

Compounding Model Error

Statement

If the learned model $\hat{p}$ has per-step prediction error $\epsilon = \max_{s,a}\, D_{\text{TV}}(\hat{p}(\cdot|s,a),\, p(\cdot|s,a))$ , then the total variation distance between the real and model-predicted state distributions after $H$ steps satisfies:

$D_{\text{TV}}(p_H, \hat{p}_H) \leq H\epsilon$

For a fixed policy, the discounted return-estimation error is bounded by:

$|V_M^\pi - V_{\hat{M}}^\pi| \leq \frac{2\gamma\epsilon\, r_{\max}}{(1-\gamma)^2}$

with the constant depending on the total-variation convention. Comparing the true optimal policy with the policy optimized inside the learned model adds an additional policy-selection step, but the same simulation-lemma scaling is the reason long imagined rollouts are fragile.

Intuition

Small per-step errors accumulate linearly with the planning horizon. Over a 50-step plan, even 1% per-step error becomes 50% total error. This is why model-based methods struggle with long-horizon tasks unless the model is very accurate or planning happens in a learned latent space where errors are more controlled.

report a correction →

Key Assumptions That Differ

	Model-Based	Model-Free
What is learned	Dynamics model $\hat{p}(s' \mid s,a)$ + policy/value	Policy $\pi(a \mid s)$ or value $Q(s,a)$ directly
Data usage	Generate synthetic data from model	Use only real experience
Sample efficiency	High (imagination augments real data)	Low (every sample is real)
Asymptotic performance	Limited by model accuracy	Limited by function approximation
Computational cost	High (model learning + planning)	Lower (no planning step)
Key algorithms	Dreamer, MuZero, MBPO, PETS	DQN, PPO, SAC, TD3

The Dyna Architecture: Bridging Both

The Dyna framework combines model-based and model-free learning:

Interact with the real environment, store transitions
Update a model-free value function from real transitions
Learn a dynamics model from the same transitions
Generate synthetic transitions from the model
Update the value function from synthetic transitions

Steps 4 to 5 can be repeated many times per real transition, amplifying sample efficiency. MBPO (Model-Based Policy Optimization) is a modern incarnation: it uses a learned ensemble of dynamics models to generate short synthetic rollouts, then trains SAC on the combined real and synthetic data.

When a Practitioner Would Use Each

Example

Board games and strategic planning

Use model-based (MuZero). The dynamics of board games are deterministic or low-stochasticity, and the planning horizon is long. MuZero learns a latent dynamics model and uses Monte Carlo tree search to plan, achieving superhuman performance in Go, chess, and shogi without knowing the rules.

Example

Atari with unlimited simulator access

Use model-free (PPO or Rainbow DQN). The Atari simulator is fast and free to run. The visual dynamics are complex and hard to model accurately. Model-free methods achieve top performance by simply running billions of frames through the simulator.

Example

Real robot learning with limited trials

Use model-based. Each real-world trial takes time, causes wear, and risks damage. Learning a dynamics model from 10 minutes of interaction and then planning with that model is far more practical than running PPO for millions of steps on a physical robot.

Example

Continuous control in MuJoCo

Use hybrid (MBPO or Dreamer). These environments have smooth, learnable dynamics but also benefit from model-free refinement. MBPO uses model-generated data to warm-start a model-free algorithm, getting the sample efficiency of model-based methods and the asymptotic performance of model-free methods.

Common Confusions

Watch Out

Model-based does not mean you need a pixel-level predictor

Early model-based RL tried to predict future observations (e.g., next video frame). This is unnecessary and wasteful. Modern methods like MuZero and Dreamer learn dynamics in a compact latent space. The model predicts latent states and rewards, not raw observations. This avoids modeling irrelevant visual details.

Watch Out

Model-free methods do have implicit models

A well-trained Q-function implicitly encodes knowledge about the dynamics. $Q(s, a)$ reflects the expected future rewards from taking action $a$ in state $s$ , which requires "knowing" what states follow. The difference is that this knowledge is implicit in the value function, not available as an explicit simulator that can be queried for arbitrary state-action pairs.

Watch Out

Sample efficiency is not the only cost

Model-based methods use fewer environment interactions but more computation (model training, planning). In fast simulators, environment interactions are cheap and computation is the bottleneck. In that regime, model-free methods may be faster in wall-clock time despite using more samples.