Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Reinforcement Learning

Model-Based Reinforcement Learning

Learning a model of the environment and planning with it. Dyna architecture, learned world models, planning as simulated experience, sample efficiency, and the model-error problem.

AdvancedTier 2Current~45 min
0

Why This Matters

Model-free RL learns a policy or value function directly from experience. Model-based RL learns a model of the environment (transition dynamics, reward function) and uses it to plan. The core tradeoff: model-based methods are far more sample-efficient but introduce a new failure mode: model error.

In domains where real-world interaction is expensive (robotics, drug design, autonomous driving), sample efficiency is not a luxury. A robot that needs 10 million falls to learn to walk is not practical. Model-based methods can reduce the required interaction by orders of magnitude, at the cost of trusting a learned model that may be wrong.

AlphaGo and AlphaZero used a known model (the rules of Go/chess). MuZero learned its own model. Dreamer and TD-MPC learn world models from pixels. Understanding when and why model-based methods work is critical for deploying RL beyond games.

Prerequisites

This page assumes familiarity with Bellman equations and MDPs. You should understand value iteration, policy evaluation, and the Bellman contraction property.

The Model-Based Framework

Definition

Environment Model

A learned environment model consists of a learned transition function P^(ss,a)\hat{P}(s' | s, a) and a learned reward function R^(s,a)\hat{R}(s, a). Together they define a simulated MDP M^=(S,A,P^,R^,γ)\hat{M} = (\mathcal{S}, \mathcal{A}, \hat{P}, \hat{R}, \gamma) that approximates the true environment MM.

Definition

Planning

Planning is the process of using a model (known or learned) to compute or improve a policy without interacting with the real environment. Value iteration, policy iteration, and tree search on M^\hat{M} are all forms of planning.

Definition

Simulation-Based Value Estimate

Given a model M^\hat{M}, the agent can generate simulated trajectories (s0,a0,r^0,s^1,a1,r^1,)(s_0, a_0, \hat{r}_0, \hat{s}_1, a_1, \hat{r}_1, \ldots) by sampling s^t+1P^(st,at)\hat{s}_{t+1} \sim \hat{P}(\cdot | s_t, a_t) and r^t=R^(st,at)\hat{r}_t = \hat{R}(s_t, a_t). These simulated trajectories can be used to estimate value functions, compute policy gradients, or generate training data for a model-free algorithm.

The Dyna Architecture

Dyna (Sutton, 1991) is the simplest and most influential model-based RL framework. It interleaves real experience, model learning, and planning in a single loop:

  1. Act: take action ata_t in the real environment, observe (rt,st+1)(r_t, s_{t+1})
  2. Learn model: update P^\hat{P} and R^\hat{R} using the real transition (st,at,rt,st+1)(s_t, a_t, r_t, s_{t+1})
  3. Direct RL: update Q(st,at)Q(s_t, a_t) using the real transition (standard Q-learning update)
  4. Planning: for nn steps, sample a previously visited state-action pair (s,a)(s, a), simulate (r^,s^)M^(\hat{r}, \hat{s}') \sim \hat{M}, and update Q(s,a)Q(s, a) using the simulated transition

The key insight: step 4 generates additional "experience" from the model at zero real-world cost. Each planning step is computationally cheap compared to a real interaction. By increasing nn, the agent can extract more learning from the same amount of real experience.

Main Theorems

Proposition

Planning Updates with a Perfect Model are Equivalent to Real Updates

Statement

If the learned model M^\hat{M} is the true model MM (i.e., P^=P\hat{P} = P and R^=R\hat{R} = R), then Q-learning updates using simulated transitions from M^\hat{M} converge to QQ^* under the same conditions as Q-learning updates from real transitions. Specifically, if the step sizes satisfy the Robbins-Monro conditions and all state-action pairs are sampled infinitely often in planning, the planning-only version converges to QQ^*.

Intuition

A perfect model generates transitions drawn from exactly the same distribution as the real environment. From the perspective of the Q-learning update rule, there is no difference between a real sample (s,a,r,s)(s, a, r, s') with r=R(s,a)r = R(s,a) and sP(s,a)s' \sim P(\cdot|s,a), and a simulated sample with the same distributions. The convergence proof for Q-learning depends only on the statistical properties of the samples, not on whether they came from real or simulated interaction.

Proof Sketch

The convergence of Q-learning (Watkins and Dayan, 1992) requires: (1) bounded rewards, (2) step sizes satisfying αt=\sum \alpha_t = \infty and αt2<\sum \alpha_t^2 < \infty, and (3) all state-action pairs visited infinitely often. If M^=M\hat{M} = M, the simulated transitions have the correct conditional distributions, so conditions (1) and (3) are satisfied when the planning procedure samples all pairs infinitely often. The convergence proof proceeds identically to the standard Q-learning proof via the stochastic approximation theorem.

Why It Matters

This justifies Dyna's planning step: when the model is accurate, planning is free learning. In practice the model is imperfect, so the real question is how model error affects the result. But the perfect-model case establishes the baseline: with a good model, you can achieve the same asymptotic result as model-free RL while using far fewer real interactions.

Failure Mode

When M^M\hat{M} \neq M, planning updates are biased. The agent converges to QM^Q^*_{\hat{M}} (the optimal value function of the simulated MDP), not QMQ^*_M. If the model is systematically wrong (e.g., it underestimates the probability of dangerous transitions), the resulting policy can be arbitrarily bad in the real environment. Model error compounds over long planning horizons: a small per-step error ϵ\epsilon in transition prediction becomes O(Hϵ)O(H \epsilon) error over an HH-step rollout.

Theorem

Value Function Error from Model Error

Statement

Let VV^* be the optimal value function of the true MDP MM and V^\hat{V}^* be the optimal value function of the learned MDP M^\hat{M}. If P^(s,a)P(s,a)TVϵ\|\hat{P}(\cdot|s,a) - P(\cdot|s,a)\|_{\text{TV}} \leq \epsilon for all (s,a)(s,a) and R^(s,a)R(s,a)ϵR|\hat{R}(s,a) - R(s,a)| \leq \epsilon_R for all (s,a)(s,a), then:

VV^ϵR+2γϵVmax1γ\|V^* - \hat{V}^*\|_\infty \leq \frac{\epsilon_R + 2\gamma \epsilon V_{\max}}{1 - \gamma}

where Vmax=Rmax/(1γ)V_{\max} = R_{\max} / (1 - \gamma) is the maximum possible value.

Intuition

Model error contributes two terms: reward prediction error (ϵR\epsilon_R) and transition prediction error (ϵ\epsilon). The transition error is multiplied by VmaxV_{\max} because a wrong transition can send the agent to a state with very different value. The 1/(1γ)1/(1 - \gamma) factor amplifies both errors because the Bellman equation propagates errors from the future into the present. Small per-step errors accumulate over the effective horizon 1/(1γ)1/(1 - \gamma).

Proof Sketch

For any state ss:

(TV)(s)(T^V)(s)maxaR(s,a)R^(s,a)+γmaxas(P(ss,a)P^(ss,a))V(s)|(\mathcal{T}V)(s) - (\hat{\mathcal{T}}V)(s)| \leq \max_a |R(s,a) - \hat{R}(s,a)| + \gamma \max_a \left|\sum_{s'} (P(s'|s,a) - \hat{P}(s'|s,a)) V(s')\right|

The first term is at most ϵR\epsilon_R. The second term is at most γ2ϵVmax\gamma \cdot 2\epsilon \cdot V_{\max} by the total variation bound. Applying the triangle inequality to V=TVV^* = \mathcal{T}V^* and V^=T^V^\hat{V}^* = \hat{\mathcal{T}}\hat{V}^* and iterating gives the result.

Why It Matters

This quantifies the price of model error. It shows that model-based RL is most dangerous when γ\gamma is close to 1 (long horizons) and the value range is large. It also explains why short-horizon model-based planning (MPC with small lookahead) is often more robust than long-horizon planning: the error amplification is controlled by the planning horizon rather than 1/(1γ)1/(1 - \gamma).

Failure Mode

The bound is worst-case and can be loose. In practice, model errors may be concentrated in rarely visited states and have little effect on the policy. The bound also assumes the same ϵ\epsilon for all state-action pairs; models are typically more accurate in well-visited regions.

Model Learning

The model M^\hat{M} is typically learned by supervised regression:

Transition model: predict st+1s_{t+1} from (st,at)(s_t, a_t). For discrete states, this is a classification problem. For continuous states, common approaches include:

  • Gaussian models: predict s^t+1=fθ(st,at)+ϵ\hat{s}_{t+1} = f_\theta(s_t, a_t) + \epsilon with learned mean and variance
  • Ensemble models: train KK independent models, use disagreement as uncertainty (PETS, Chua et al. 2018)
  • Latent dynamics: learn a latent state ztz_t and predict transitions in latent space (Dreamer, Hafner et al. 2020)

Reward model: predict rtr_t from (st,at)(s_t, a_t). Usually simpler than the transition model.

The critical design choice is how to handle model uncertainty. An overconfident model can lead the agent into regions where the model is wrong. Ensemble disagreement, Bayesian neural networks, and explicit epistemic uncertainty estimation all attempt to address this.

Model Predictive Control (MPC)

Instead of solving the learned MDP fully, Model Predictive Control plans only a short horizon ahead:

  1. From current state sts_t, simulate multiple action sequences of length HH using M^\hat{M}
  2. Evaluate each sequence by the cumulative predicted reward (plus a terminal value estimate)
  3. Execute only the first action of the best sequence
  4. Re-plan at the next step

MPC is robust to model error because it replans at every step and only commits to one action. Long-horizon model errors never compound because the horizon HH is short. The cost: MPC is computationally expensive at inference time, requiring many forward simulations per action.

AlphaZero and MuZero

AlphaZero (Silver et al., 2018) combines a known model (game rules) with learned value and policy networks. Monte Carlo Tree Search (MCTS) uses the model for planning, guided by the neural network evaluations. The result: superhuman play in Go, chess, and shogi.

MuZero (Schrittwieser et al., 2020) extends this to settings without a known model. It learns three components:

  • Representation function hh: encodes an observation into a latent state
  • Dynamics function gg: predicts the next latent state and reward given a latent state and action
  • Prediction function ff: predicts the value and policy from a latent state

MuZero plans in latent space, never reconstructing observations. It achieves superhuman performance in Atari (no known model) while matching AlphaZero in board games.

Common Confusions

Watch Out

Model-based does not mean the model must be perfect

A common objection is that learned models are always wrong, so model-based RL is doomed. In practice, the model only needs to be accurate enough in the regions that matter for the current policy. Short-horizon planning (MPC) and model-free corrections (Dyna's direct RL step) can compensate for model error. The question is not "is the model perfect?" but "is the model useful?"

Watch Out

Planning horizon vs discount factor

The planning horizon HH in MPC is not the same as the effective horizon 1/(1γ)1/(1 - \gamma). MPC can use a short horizon H=10H = 10 even when γ=0.99\gamma = 0.99 (effective horizon 100). The value estimate at the end of the HH-step rollout compensates for the truncated horizon. Short HH limits model error accumulation at the cost of relying on the terminal value estimate.

Watch Out

MuZero does not learn a pixel-level model

MuZero's dynamics function operates in a learned latent space, not in observation space. It does not predict future frames. This is a deliberate choice: predicting every pixel is wasteful if you only need to know the value and best action. The latent dynamics only need to capture information relevant for planning, not for reconstruction.

Key Takeaways

  • Model-based RL trades sample efficiency for computational cost and model error risk
  • Dyna interleaves real experience, model learning, and planning in a single loop
  • With a perfect model, planning updates are equivalent to real updates
  • Model error compounds over long planning horizons: O(ϵ/(1γ))O(\epsilon / (1 - \gamma)) for per-step error ϵ\epsilon
  • MPC limits error accumulation by replanning at every step with a short horizon
  • MuZero plans in learned latent space, avoiding the need for pixel-level predictions
  • Ensemble disagreement and uncertainty-aware planning mitigate overconfident models

Exercises

ExerciseCore

Problem

In Dyna-Q with n=5n = 5 planning steps per real step, the agent takes 1000 real steps. How many total Q-learning updates does the agent perform (counting both real and planning updates)? How does this compare to pure model-free Q-learning over 1000 steps?

ExerciseCore

Problem

A learned model has per-step transition error ϵ=0.05\epsilon = 0.05 (total variation) with γ=0.9\gamma = 0.9 and Rmax=1R_{\max} = 1. Using the model error bound, what is the worst-case difference between VV^* and V^\hat{V}^*? What happens if γ=0.99\gamma = 0.99?

ExerciseAdvanced

Problem

In MPC with planning horizon HH and discount factor γ\gamma, the agent evaluates action sequences by t=0H1γtr^t+γHVθ(sH)\sum_{t=0}^{H-1} \gamma^t \hat{r}_t + \gamma^H V_\theta(s_H). Explain why increasing HH reduces reliance on VθV_\theta but increases exposure to model error. What is the optimal HH if the model has per-step error ϵ\epsilon and the value function has error ϵV\epsilon_V?

References

Canonical:

  • Sutton, "Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming" (Dyna, 1991)
  • Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 8

AlphaZero/MuZero:

  • Silver et al., "A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go" (AlphaZero, 2018)
  • Schrittwieser et al., "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" (MuZero, 2020)

Modern World Models:

  • Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination" (Dreamer, 2020)
  • Chua et al., "Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models" (PETS, 2018)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics