Reinforcement Learning
Model-Based Reinforcement Learning
Learning a model of the environment and planning with it. Dyna architecture, learned world models, planning as simulated experience, sample efficiency, and the model-error problem.
Prerequisites
Why This Matters
Model-free RL learns a policy or value function directly from experience. Model-based RL learns a model of the environment (transition dynamics, reward function) and uses it to plan. The core tradeoff: model-based methods are far more sample-efficient but introduce a new failure mode: model error.
In domains where real-world interaction is expensive (robotics, drug design, autonomous driving), sample efficiency is not a luxury. A robot that needs 10 million falls to learn to walk is not practical. Model-based methods can reduce the required interaction by orders of magnitude, at the cost of trusting a learned model that may be wrong.
AlphaGo and AlphaZero used a known model (the rules of Go/chess). MuZero learned its own model. Dreamer and TD-MPC learn world models from pixels. Understanding when and why model-based methods work is critical for deploying RL beyond games.
Prerequisites
This page assumes familiarity with Bellman equations and MDPs. You should understand value iteration, policy evaluation, and the Bellman contraction property.
The Model-Based Framework
Environment Model
A learned environment model consists of a learned transition function and a learned reward function . Together they define a simulated MDP that approximates the true environment .
Planning
Planning is the process of using a model (known or learned) to compute or improve a policy without interacting with the real environment. Value iteration, policy iteration, and tree search on are all forms of planning.
Simulation-Based Value Estimate
Given a model , the agent can generate simulated trajectories by sampling and . These simulated trajectories can be used to estimate value functions, compute policy gradients, or generate training data for a model-free algorithm.
The Dyna Architecture
Dyna (Sutton, 1991) is the simplest and most influential model-based RL framework. It interleaves real experience, model learning, and planning in a single loop:
- Act: take action in the real environment, observe
- Learn model: update and using the real transition
- Direct RL: update using the real transition (standard Q-learning update)
- Planning: for steps, sample a previously visited state-action pair , simulate , and update using the simulated transition
The key insight: step 4 generates additional "experience" from the model at zero real-world cost. Each planning step is computationally cheap compared to a real interaction. By increasing , the agent can extract more learning from the same amount of real experience.
Main Theorems
Planning Updates with a Perfect Model are Equivalent to Real Updates
Statement
If the learned model is the true model (i.e., and ), then Q-learning updates using simulated transitions from converge to under the same conditions as Q-learning updates from real transitions. Specifically, if the step sizes satisfy the Robbins-Monro conditions and all state-action pairs are sampled infinitely often in planning, the planning-only version converges to .
Intuition
A perfect model generates transitions drawn from exactly the same distribution as the real environment. From the perspective of the Q-learning update rule, there is no difference between a real sample with and , and a simulated sample with the same distributions. The convergence proof for Q-learning depends only on the statistical properties of the samples, not on whether they came from real or simulated interaction.
Proof Sketch
The convergence of Q-learning (Watkins and Dayan, 1992) requires: (1) bounded rewards, (2) step sizes satisfying and , and (3) all state-action pairs visited infinitely often. If , the simulated transitions have the correct conditional distributions, so conditions (1) and (3) are satisfied when the planning procedure samples all pairs infinitely often. The convergence proof proceeds identically to the standard Q-learning proof via the stochastic approximation theorem.
Why It Matters
This justifies Dyna's planning step: when the model is accurate, planning is free learning. In practice the model is imperfect, so the real question is how model error affects the result. But the perfect-model case establishes the baseline: with a good model, you can achieve the same asymptotic result as model-free RL while using far fewer real interactions.
Failure Mode
When , planning updates are biased. The agent converges to (the optimal value function of the simulated MDP), not . If the model is systematically wrong (e.g., it underestimates the probability of dangerous transitions), the resulting policy can be arbitrarily bad in the real environment. Model error compounds over long planning horizons: a small per-step error in transition prediction becomes error over an -step rollout.
Value Function Error from Model Error
Statement
Let be the optimal value function of the true MDP and be the optimal value function of the learned MDP . If for all and for all , then:
where is the maximum possible value.
Intuition
Model error contributes two terms: reward prediction error () and transition prediction error (). The transition error is multiplied by because a wrong transition can send the agent to a state with very different value. The factor amplifies both errors because the Bellman equation propagates errors from the future into the present. Small per-step errors accumulate over the effective horizon .
Proof Sketch
For any state :
The first term is at most . The second term is at most by the total variation bound. Applying the triangle inequality to and and iterating gives the result.
Why It Matters
This quantifies the price of model error. It shows that model-based RL is most dangerous when is close to 1 (long horizons) and the value range is large. It also explains why short-horizon model-based planning (MPC with small lookahead) is often more robust than long-horizon planning: the error amplification is controlled by the planning horizon rather than .
Failure Mode
The bound is worst-case and can be loose. In practice, model errors may be concentrated in rarely visited states and have little effect on the policy. The bound also assumes the same for all state-action pairs; models are typically more accurate in well-visited regions.
Model Learning
The model is typically learned by supervised regression:
Transition model: predict from . For discrete states, this is a classification problem. For continuous states, common approaches include:
- Gaussian models: predict with learned mean and variance
- Ensemble models: train independent models, use disagreement as uncertainty (PETS, Chua et al. 2018)
- Latent dynamics: learn a latent state and predict transitions in latent space (Dreamer, Hafner et al. 2020)
Reward model: predict from . Usually simpler than the transition model.
The critical design choice is how to handle model uncertainty. An overconfident model can lead the agent into regions where the model is wrong. Ensemble disagreement, Bayesian neural networks, and explicit epistemic uncertainty estimation all attempt to address this.
Model Predictive Control (MPC)
Instead of solving the learned MDP fully, Model Predictive Control plans only a short horizon ahead:
- From current state , simulate multiple action sequences of length using
- Evaluate each sequence by the cumulative predicted reward (plus a terminal value estimate)
- Execute only the first action of the best sequence
- Re-plan at the next step
MPC is robust to model error because it replans at every step and only commits to one action. Long-horizon model errors never compound because the horizon is short. The cost: MPC is computationally expensive at inference time, requiring many forward simulations per action.
AlphaZero and MuZero
AlphaZero (Silver et al., 2018) combines a known model (game rules) with learned value and policy networks. Monte Carlo Tree Search (MCTS) uses the model for planning, guided by the neural network evaluations. The result: superhuman play in Go, chess, and shogi.
MuZero (Schrittwieser et al., 2020) extends this to settings without a known model. It learns three components:
- Representation function : encodes an observation into a latent state
- Dynamics function : predicts the next latent state and reward given a latent state and action
- Prediction function : predicts the value and policy from a latent state
MuZero plans in latent space, never reconstructing observations. It achieves superhuman performance in Atari (no known model) while matching AlphaZero in board games.
Common Confusions
Model-based does not mean the model must be perfect
A common objection is that learned models are always wrong, so model-based RL is doomed. In practice, the model only needs to be accurate enough in the regions that matter for the current policy. Short-horizon planning (MPC) and model-free corrections (Dyna's direct RL step) can compensate for model error. The question is not "is the model perfect?" but "is the model useful?"
Planning horizon vs discount factor
The planning horizon in MPC is not the same as the effective horizon . MPC can use a short horizon even when (effective horizon 100). The value estimate at the end of the -step rollout compensates for the truncated horizon. Short limits model error accumulation at the cost of relying on the terminal value estimate.
MuZero does not learn a pixel-level model
MuZero's dynamics function operates in a learned latent space, not in observation space. It does not predict future frames. This is a deliberate choice: predicting every pixel is wasteful if you only need to know the value and best action. The latent dynamics only need to capture information relevant for planning, not for reconstruction.
Key Takeaways
- Model-based RL trades sample efficiency for computational cost and model error risk
- Dyna interleaves real experience, model learning, and planning in a single loop
- With a perfect model, planning updates are equivalent to real updates
- Model error compounds over long planning horizons: for per-step error
- MPC limits error accumulation by replanning at every step with a short horizon
- MuZero plans in learned latent space, avoiding the need for pixel-level predictions
- Ensemble disagreement and uncertainty-aware planning mitigate overconfident models
Exercises
Problem
In Dyna-Q with planning steps per real step, the agent takes 1000 real steps. How many total Q-learning updates does the agent perform (counting both real and planning updates)? How does this compare to pure model-free Q-learning over 1000 steps?
Problem
A learned model has per-step transition error (total variation) with and . Using the model error bound, what is the worst-case difference between and ? What happens if ?
Problem
In MPC with planning horizon and discount factor , the agent evaluates action sequences by . Explain why increasing reduces reliance on but increases exposure to model error. What is the optimal if the model has per-step error and the value function has error ?
References
Canonical:
- Sutton, "Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming" (Dyna, 1991)
- Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 8
AlphaZero/MuZero:
- Silver et al., "A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go" (AlphaZero, 2018)
- Schrittwieser et al., "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" (MuZero, 2020)
Modern World Models:
- Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination" (Dreamer, 2020)
- Chua et al., "Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models" (PETS, 2018)
Next Topics
- World models and planning: deeper treatment of latent dynamics, Dreamer, video prediction models
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Bellman EquationsLayer 2
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A