Model-Based Reinforcement Learning

Sneiderman, Robby

RL Theory

Model-Based Reinforcement Learning

Learning a model of the environment and planning with it. Dyna architecture, learned world models, planning as simulated experience, sample efficiency, and the model-error problem.

AdvancedTier 2CurrentReference~45 min

Prerequisites

Bellman Equations Markov Decision Processes Deep RL for Control Reward Systems and Reinforcement Learning Neuroscience

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 3 | tier 2. This page has 4 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

World Models and Planning

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Model-free RL learns a policy or value function directly from experience. Model-based RL learns a model of the environment (transition dynamics, reward function) and uses it to plan. The core tradeoff: model-based methods are far more sample-efficient but introduce a new failure mode: model error.

In domains where real-world interaction is expensive (robotics, drug design, autonomous driving), sample efficiency is not a luxury. A robot that needs 10 million falls to learn to walk is not practical. Model-based methods can reduce the required interaction by orders of magnitude, at the cost of trusting a learned model that may be wrong.

AlphaGo and AlphaZero used a known model (the rules of Go/chess). MuZero learned its own model. Dreamer and TD-MPC learn world models from pixels. Understanding when and why model-based methods work is critical for deploying RL beyond games.

Prerequisites

This page assumes familiarity with Bellman equations and MDPs. You should understand value iteration, policy evaluation, and the Bellman contraction property.

The Model-Based Framework

Definition

Environment Model $(\hat{P}, \hat{R})$

A learned environment model consists of a learned transition function $\hat{P}(s' | s, a)$ and a learned reward function $\hat{R}(s, a)$ . Together they define a simulated MDP $\hat{M} = (\mathcal{S}, \mathcal{A}, \hat{P}, \hat{R}, \gamma)$ that approximates the true environment $M$ .

Definition

Planning

Planning is the process of using a model (known or learned) to compute or improve a policy without interacting with the real environment. Value iteration, policy iteration, and tree search on $\hat{M}$ are all forms of planning.

Definition

Simulation-Based Value Estimate

Given a model $\hat{M}$ , the agent can generate simulated trajectories $(s_0, a_0, \hat{r}_0, \hat{s}_1, a_1, \hat{r}_1, \ldots)$ by sampling $\hat{s}_{t+1} \sim \hat{P}(\cdot | s_t, a_t)$ and $\hat{r}_t = \hat{R}(s_t, a_t)$ . These simulated trajectories can be used to estimate value functions, compute policy gradients, or generate training data for a model-free algorithm.

The Dyna Architecture

Dyna (Sutton, 1991) is the simplest and most influential model-based RL framework. It interleaves real experience, model learning, and planning in a single loop:

Act: take action $a_t$ in the real environment, observe $(r_t, s_{t+1})$
Learn model: update $\hat{P}$ and $\hat{R}$ using the real transition $(s_t, a_t, r_t, s_{t+1})$
Direct RL: update $Q(s_t, a_t)$ using the real transition (standard Q-learning update)
Planning: for $n$ steps, sample a previously visited state-action pair $(s, a)$ , simulate $(\hat{r}, \hat{s}') \sim \hat{M}$ , and update $Q(s, a)$ using the simulated transition

The key insight: step 4 generates additional "experience" from the model at zero real-world cost. Each planning step is computationally cheap compared to a real interaction. By increasing $n$ , the agent can extract more learning from the same amount of real experience.

Main Theorems

Proposition

Planning Updates with a Perfect Model are Equivalent to Real Updates

Statement

If the learned model $\hat{M}$ is the true model $M$ (i.e., $\hat{P} = P$ and $\hat{R} = R$ ), then Q-learning updates using simulated transitions from $\hat{M}$ converge to $Q^*$ under the same conditions as Q-learning updates from real transitions. Specifically, if the step sizes satisfy the Robbins-Monro conditions and all state-action pairs are sampled infinitely often in planning, the planning-only version converges to $Q^*$ .

Intuition

A perfect model generates transitions drawn from exactly the same distribution as the real environment. From the perspective of the Q-learning update rule, there is no difference between a real sample $(s, a, r, s')$ with $r = R(s,a)$ and $s' \sim P(\cdot|s,a)$ , and a simulated sample with the same distributions. The convergence proof for Q-learning depends only on the statistical properties of the samples, not on whether they came from real or simulated interaction.

Proof Sketch

The convergence of Q-learning (Watkins and Dayan, 1992) requires: (1) bounded rewards, (2) step sizes satisfying $\sum \alpha_t = \infty$ and $\sum \alpha_t^2 < \infty$ , and (3) all state-action pairs visited infinitely often. If $\hat{M} = M$ , the simulated transitions have the correct conditional distributions, so conditions (1) and (3) are satisfied when the planning procedure samples all pairs infinitely often. The convergence proof proceeds identically to the standard Q-learning proof via the stochastic approximation theorem.

Why It Matters

This justifies Dyna's planning step: when the model is accurate, planning is free learning. In practice the model is imperfect, so the real question is how model error affects the result. But the perfect-model case establishes the baseline: with a good model, you can achieve the same asymptotic result as model-free RL while using far fewer real interactions.

Failure Mode

When $\hat{M} \neq M$ , planning updates are biased. The agent converges to $Q^*_{\hat{M}}$ (the optimal value function of the simulated MDP), not $Q^*_M$ . If the model is systematically wrong (e.g., it underestimates the probability of dangerous transitions), the resulting policy can be arbitrarily bad in the real environment. Model error compounds over long planning horizons: a small per-step transition error can become $O(H\epsilon)$ distributional error over an $H$ -step rollout, and discounted worst-case value bounds scale like $O(\epsilon/(1-\gamma)^2)$ .

report a correction →

Theorem

Value Function Error from Model Error

Statement

Let $V^*$ be the optimal value function of the true MDP $M$ and $\hat{V}^*$ be the optimal value function of the learned MDP $\hat{M}$ . If $\|\hat{P}(\cdot|s,a) - P(\cdot|s,a)\|_{\text{TV}} \leq \epsilon$ for all $(s,a)$ and $|\hat{R}(s,a) - R(s,a)| \leq \epsilon_R$ for all $(s,a)$ , then:

$\|V^* - \hat{V}^*\|_\infty \leq \frac{\epsilon_R + 2\gamma \epsilon V_{\max}}{1 - \gamma}$

where $V_{\max} = R_{\max} / (1 - \gamma)$ is the maximum possible value.

Intuition

Model error contributes two terms: reward prediction error ( $\epsilon_R$ ) and transition prediction error ( $\epsilon$ ). The transition error is multiplied by $V_{\max}$ because a wrong transition can send the agent to a state with very different value. The $1/(1 - \gamma)$ factor amplifies both errors because the Bellman equation propagates errors from the future into the present. Small per-step errors accumulate over the effective horizon $1/(1 - \gamma)$ .

Proof Sketch

For any state $s$ :

$|(\mathcal{T}V)(s) - (\hat{\mathcal{T}}V)(s)| \leq \max_a |R(s,a) - \hat{R}(s,a)| + \gamma \max_a \left|\sum_{s'} (P(s'|s,a) - \hat{P}(s'|s,a)) V(s')\right|$

The first term is at most $\epsilon_R$ . The second term is at most $\gamma \cdot 2\epsilon \cdot V_{\max}$ by the total variation bound. Applying the triangle inequality to $V^* = \mathcal{T}V^*$ and $\hat{V}^* = \hat{\mathcal{T}}\hat{V}^*$ and iterating gives the result.

Why It Matters

This quantifies the price of model error. It shows that model-based RL is most dangerous when $\gamma$ is close to 1 (long horizons) and the value range is large. It also explains why short-horizon model-based planning (MPC with small lookahead) is often more robust than long-horizon planning: the error amplification is controlled by the planning horizon rather than $1/(1 - \gamma)$ .

Failure Mode

The bound is worst-case and can be loose. In practice, model errors may be concentrated in rarely visited states and have little effect on the policy. The bound also assumes the same $\epsilon$ for all state-action pairs; models are typically more accurate in well-visited regions.

report a correction →

Model Learning

The model $\hat{M}$ is typically learned by supervised regression:

Transition model: predict $s_{t+1}$ from $(s_t, a_t)$ . For discrete states, this is a classification problem. For continuous states, common approaches include:

Gaussian models: predict $\hat{s}_{t+1} = f_\theta(s_t, a_t) + \epsilon$ with learned mean and variance
Ensemble models: train $K$ independent models, use disagreement as uncertainty (PETS, Chua et al. 2018)
Latent dynamics: learn a latent state $z_t$ and predict transitions in latent space (Dreamer, Hafner et al. 2020)

Reward model: predict $r_t$ from $(s_t, a_t)$ . Usually simpler than the transition model.

The critical design choice is how to handle model uncertainty. An overconfident model can lead the agent into regions where the model is wrong. Ensemble disagreement, Bayesian neural networks, and explicit epistemic uncertainty estimation all attempt to address this.

Model Predictive Control (MPC)

Instead of solving the learned MDP fully, Model Predictive Control plans only a short horizon ahead:

From current state $s_t$ , simulate multiple action sequences of length $H$ using $\hat{M}$
Evaluate each sequence by the cumulative predicted reward (plus a terminal value estimate)
Execute only the first action of the best sequence
Re-plan at the next step

MPC is robust to model error because it replans at every step and only commits to one action. Long-horizon model errors never compound because the horizon $H$ is short. The cost: MPC is computationally expensive at inference time, requiring many forward simulations per action.

AlphaZero and MuZero

AlphaZero (Silver et al., 2018) combines a known model (game rules) with learned value and policy networks. Monte Carlo Tree Search (MCTS) uses the model for planning, guided by the neural network evaluations. The result: superhuman play in Go, chess, and shogi.

MuZero (Schrittwieser et al., 2020) extends this to settings without a known model. It learns three components:

Representation function $h$ : encodes an observation into a latent state
Dynamics function $g$ : predicts the next latent state and reward given a latent state and action
Prediction function $f$ : predicts the value and policy from a latent state

MuZero plans in latent space, never reconstructing observations. It achieves superhuman performance in Atari (no known model) while matching AlphaZero in board games.

Common Confusions

Watch Out

Model-based does not mean the model must be perfect

A common objection is that learned models are always wrong, so model-based RL is doomed. In practice, the model only needs to be accurate enough in the regions that matter for the current policy. Short-horizon planning (MPC) and model-free corrections (Dyna's direct RL step) can compensate for model error. The question is not "is the model perfect?" but "is the model useful?"

Watch Out

Planning horizon vs discount factor

The planning horizon $H$ in MPC is not the same as the effective horizon $1/(1 - \gamma)$ . MPC can use a short horizon $H = 10$ even when $\gamma = 0.99$ (effective horizon 100). The value estimate at the end of the $H$ -step rollout compensates for the truncated horizon. Short $H$ limits model error accumulation at the cost of relying on the terminal value estimate.

Watch Out

MuZero does not learn a pixel-level model

MuZero's dynamics function operates in a learned latent space, not in observation space. It does not predict future frames. This is a deliberate choice: predicting every pixel is wasteful if you only need to know the value and best action. The latent dynamics only need to capture information relevant for planning, not for reconstruction.

Summary

Model-based RL trades sample efficiency for computational cost and model error risk
Dyna interleaves real experience, model learning, and planning in a single loop
With a perfect model, planning updates are equivalent to real updates
Model error compounds over long planning horizons: discounted worst-case value bounds scale like $O(\epsilon / (1 - \gamma)^2)$ for per-step transition error $\epsilon$
MPC limits error accumulation by replanning at every step with a short horizon
MuZero plans in learned latent space, avoiding the need for pixel-level predictions
Ensemble disagreement and uncertainty-aware planning mitigate overconfident models

Exercises

ExerciseCore

Problem

In Dyna-Q with $n = 5$ planning steps per real step, the agent takes 1000 real steps. How many total Q-learning updates does the agent perform (counting both real and planning updates)? How does this compare to pure model-free Q-learning over 1000 steps?

ExerciseCore

Problem

A learned model has per-step transition error $\epsilon = 0.05$ (total variation) with $\gamma = 0.9$ and $R_{\max} = 1$ . Using the model error bound, what is the worst-case difference between $V^*$ and $\hat{V}^*$ ? What happens if $\gamma = 0.99$ ?

ExerciseAdvanced

Problem

In MPC with planning horizon $H$ and discount factor $\gamma$ , the agent evaluates action sequences by $\sum_{t=0}^{H-1} \gamma^t \hat{r}_t + \gamma^H V_\theta(s_H)$ . Explain why increasing $H$ reduces reliance on $V_\theta$ but increases exposure to model error. What is the optimal $H$ if the model has per-step error $\epsilon$ and the value function has error $\epsilon_V$ ?

References

Canonical:

Sutton, "Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming" (Dyna, 1991)
Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 8

AlphaZero/MuZero:

Silver et al., "A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go" (AlphaZero, 2018)
Schrittwieser et al., "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" (MuZero, 2020)

Modern World Models:

Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination" (Dreamer, 2020)
Chua et al., "Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models" (PETS, 2018)

Next Topics

World models and planning: deeper treatment of latent dynamics, Dreamer, video prediction models

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Bellman Equationslayer 2 · tier 1
Markov Decision Processeslayer 2 · tier 1
Deep RL for Controllayer 4 · tier 3
Reward Systems and Reinforcement Learning Neurosciencelayer 4 · tier 3

Derived topics

1

World Models and Planninglayer 4 · tier 2

Graph-backed continuations

World Models and Planning