Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Beyond Llms

World Models and Planning

Learning a model of the environment and planning inside it: Dreamer's latent dynamics, MuZero's learned model with MCTS, video world models, and why planning in imagination is the path to sample-efficient and safe AI.

AdvancedTier 2Frontier~60 min
0

Why This Matters

Model-free RL (Q-learning, PPO) treats the environment as a black box and learns purely from trial and error. This is sample-inefficient: DQN needs hundreds of millions of frames to learn Atari games that humans master in minutes. It is also unsafe: you cannot test an action before executing it.

World models invert this: learn a model of the environment, then plan inside it. Imagine trajectories, evaluate actions, and choose the best plan. all without touching the real environment. This is how humans navigate: we simulate outcomes mentally before acting. World models bring this capability to RL agents.

Mental Model

Think of a chess player analyzing a position. They do not need to play physical moves on a board. They simulate sequences of moves in their head, evaluate the resulting positions, and choose the best line. A world model is the learned "board" in the agent's head. Planning is the search over imagined move sequences.

The central tradeoff: a learned model is never perfect. Plans based on an imperfect model can be worse than model-free learning if the model errors compound over long horizons. The art of model-based RL is managing this tradeoff.

Formal Setup

Definition

Learned World Model

A learned world model consists of:

  • A representation function hθ:OZh_\theta: \mathcal{O} \to \mathcal{Z} mapping observations to latent states
  • A dynamics model fθ:Z×AZf_\theta: \mathcal{Z} \times \mathcal{A} \to \mathcal{Z} predicting the next latent state
  • A reward predictor rθ:Z×ARr_\theta: \mathcal{Z} \times \mathcal{A} \to \mathbb{R} predicting immediate reward
  • Optionally, a decoder dθ:ZOd_\theta: \mathcal{Z} \to \mathcal{O} reconstructing observations (used for training, not planning)

Given a current observation oto_t, the model can simulate forward: zt=hθ(ot)z_t = h_\theta(o_t), zt+1=fθ(zt,at)z_{t+1} = f_\theta(z_t, a_t), r^t=rθ(zt,at)\hat{r}_t = r_\theta(z_t, a_t), and so on for any sequence of actions.

Definition

Planning

Planning uses the world model to select actions by searching over imagined trajectories. Given the current latent state ztz_t, planning evaluates candidate action sequences {at,at+1,,at+H}\{a_t, a_{t+1}, \ldots, a_{t+H}\} by simulating them through fθf_\theta and summing predicted rewards:

R^(at:t+H)=k=0Hγkrθ(zt+k,at+k),zt+k+1=fθ(zt+k,at+k)\hat{R}(a_{t:t+H}) = \sum_{k=0}^{H} \gamma^k r_\theta(z_{t+k}, a_{t+k}), \quad z_{t+k+1} = f_\theta(z_{t+k}, a_{t+k})

The agent executes the first action of the best plan and replans at the next step.

Definition

Simulation Lemma

The simulation lemma quantifies how model errors affect planning quality. If the model has per-step error ϵ\epsilon (in transition prediction), then over a horizon HH, the value estimate error grows as O(ϵH2)O(\epsilon H^2) in the worst case. This quadratic growth in horizon is the fundamental limitation of model-based planning.

Main Theorems

Theorem

Model-Based RL Regret via Simulation Lemma

Statement

Let P^\hat{P} be a learned transition model with P^(s,a)P(s,a)1ϵ\|\hat{P}(\cdot|s,a) - P(\cdot|s,a)\|_1 \leq \epsilon for all (s,a)(s,a). Let π^\hat{\pi} be the policy obtained by planning optimally in P^\hat{P}. Then the performance gap between π^\hat{\pi} and the true optimal policy π\pi^* satisfies:

Vπ(s)Vπ^(s)2γϵRmax(1γ)2V^{\pi^*}(s) - V^{\hat{\pi}}(s) \leq \frac{2\gamma \epsilon R_{\max}}{(1-\gamma)^2}

For a finite horizon HH, the bound becomes O(ϵH2Rmax)O(\epsilon H^2 R_{\max}).

Intuition

Each step of planning introduces an error of order ϵ\epsilon (the model is wrong by ϵ\epsilon in TV distance). Over an effective horizon of 1/(1γ)1/(1-\gamma) steps, these errors accumulate. The (1γ)2(1-\gamma)^{-2} dependence means that long-horizon problems (small 1γ1-\gamma) amplify model errors quadratically. This is why model-based methods struggle with long-horizon planning unless the model is very accurate.

Proof Sketch

Decompose the value difference using a telescoping sum over time steps. At each step, the value under the true dynamics differs from the value under the model dynamics by at most γϵVγϵRmax/(1γ)\gamma \epsilon \|V^*\|_\infty \leq \gamma \epsilon R_{\max}/(1-\gamma). Summing over the effective horizon 1/(1γ)1/(1-\gamma) gives the result.

Why It Matters

This theorem explains both the promise and the limitation of world models. The promise: if ϵ\epsilon is small, model-based methods can find near-optimal policies without ever executing suboptimal actions in the real environment (sample efficiency). The limitation: the quadratic dependence on horizon means small model errors become large planning errors over long time scales. This motivates learning in latent space (where models can be more accurate) and short-horizon planning with replanning.

Failure Mode

When model errors are correlated across states (systematic bias rather than random noise), the actual performance gap can be much larger than the worst-case bound. A model that consistently predicts slower dynamics, for example, produces systematically overconfident plans.

Dreamer: Latent World Models

The Dreamer family (v1, v2, v3) learns a latent-space world model and trains a policy entirely on imagined trajectories.

Architecture:

  1. Encoder hθh_\theta: maps image observations to latent states
  2. Recurrent State Space Model (RSSM): combines deterministic recurrence with stochastic latent variables for dynamics prediction
  3. Reward predictor and continuation predictor (predicts episode termination)
  4. Decoder: reconstructs observations from latent states (for model training)

Training loop:

Proposition

Dreamer Imagination-Based Policy Optimization

Statement

Dreamer optimizes the policy πθ\pi_\theta by maximizing the expected imagined return:

J(θ)=Eπθ,fθ[t=0Hγtr^t]J(\theta) = \mathbb{E}_{\pi_\theta, f_\theta} \left[ \sum_{t=0}^{H} \gamma^t \hat{r}_t \right]

where the expectation is over trajectories generated by rolling out πθ\pi_\theta in the learned world model fθf_\theta. The policy gradient is computed by backpropagating through the differentiable world model (no REINFORCE needed).

The value function VψV_\psi is trained on imagined trajectories to compute λ\lambda-returns for the actor update, analogous to GAE in model-free actor-critic.

Intuition

Because the world model is a differentiable neural network, you can compute analytic gradients of the imagined return with respect to the policy parameters. This is structurally different from model-free policy gradients, which must estimate gradients from sampled rewards. Dreamer turns RL into supervised learning: the "data" is imagined trajectories, and the "labels" are the predicted rewards.

Why It Matters

Dreamer achieves state-of-the-art sample efficiency on visual control tasks. Training the policy on imagined data means the agent can improve without additional real-world interactions. Dreamer v3 matches or exceeds model-free methods across diverse domains (Atari, DMC, Minecraft) while using 10-50x fewer environment steps.

MuZero: Learned Model + Tree Search

MuZero (DeepMind, 2020) combines a learned model with Monte Carlo Tree Search (MCTS), achieving superhuman performance on Go, chess, shogi, and Atari without knowing the rules of any game.

Key components:

  1. Representation function hh: maps observation to initial latent state
  2. Dynamics function gg: given latent state and action, predicts next latent state and immediate reward
  3. Prediction function ff: given latent state, predicts policy and value (as in AlphaZero)

Critical insight: MuZero's dynamics function does not predict observations (pixels). It predicts latent states that are useful for planning. The model is trained end-to-end to produce accurate value and policy predictions after multiple steps of model rollout, not to reconstruct the environment faithfully.

The MCTS planning procedure uses the learned model to simulate forward and backpropagate value estimates through the search tree, just as AlphaZero does with the true game rules.

Video World Models

A recent frontier: using large pretrained video generation models as world simulators. The idea is that a model trained to predict future video frames has implicitly learned physics, object permanence, and dynamics.

Approach:

  1. Train (or use a pretrained) video diffusion model on large-scale video data
  2. Condition on the current frame and a proposed action (e.g., joystick input)
  3. Generate future frames as a simulation of the action's consequences
  4. Use the generated video for planning or policy training

Key challenges:

  • Controllability: standard video models predict what will happen, not what happens given a specific action. Action-conditioned generation requires architectural changes or fine-tuning
  • Consistency: generated videos can drift or hallucinate over long horizons
  • Speed: diffusion-based generation is slow, limiting the number of imagined trajectories that can be evaluated for planning

This approach has shown promising results in game environments and simple robotic settings, but the computational cost and consistency challenges remain significant barriers for real-time planning.

Model-Free vs. Model-Based

Model-FreeModel-Based
Sample efficiencyLow (millions of steps)High (thousands of steps)
Computation per stepLowHigh (model rollouts + planning)
Asymptotic performanceCan be optimalLimited by model accuracy
SafetyMust try dangerous actionsCan simulate before acting
Long-horizon tasksRobust (no compounding error)Degrades as horizon grows

In practice, the best systems combine both: use a model for short-horizon planning and value estimation, but ground decisions in real experience to correct model errors. Dreamer exemplifies this: the model generates training data, but the policy is evaluated in the real environment.

Common Confusions

Watch Out

World models do not need to predict pixels

Early world models (Ha & Schmidhuber, 2018) generated pixel-level predictions. Modern approaches (MuZero, Dreamer) learn latent dynamics that never produce pixels during planning. The decoder is a training aid, not a planning component. Predicting in latent space is faster, more compact, and avoids wasting model capacity on irrelevant visual details.

Watch Out

Planning does not require a perfect model

A common objection is that model errors make planning useless. In reality, even crude models enable useful planning when combined with (1) short planning horizons with frequent replanning, (2) uncertainty estimation to avoid relying on uncertain predictions, and (3) real-world experience to correct model-based decisions. MuZero demonstrates superhuman performance despite imperfect latent dynamics.

Watch Out

LLMs are not world models in the RL sense

Language models can predict consequences of actions described in text, but they do not learn dynamics in a way that supports systematic search and planning. An RL world model must support repeated forward simulation at arbitrary action sequences, which current LLMs cannot do efficiently or accurately for physical environments. The relationship between LLM "world knowledge" and formal world models is an open research question.

Summary

  • World models: learn fθ(zt,at)zt+1f_\theta(z_t, a_t) \to z_{t+1}, then plan by simulating imagined trajectories
  • Simulation lemma: model error ϵ\epsilon causes O(ϵ/(1γ)2)O(\epsilon/(1-\gamma)^2) planning error. quadratic in effective horizon
  • Dreamer: latent RSSM world model, policy trained on imagined trajectories, backpropagation through differentiable model
  • MuZero: learned latent model + MCTS, does not predict observations, trained end-to-end for value/policy accuracy
  • Video world models: pretrained video generators as environment simulators
  • Model-based RL trades computation for sample efficiency

Exercises

ExerciseCore

Problem

If a learned model has per-step TV distance error ϵ=0.01\epsilon = 0.01 and γ=0.99\gamma = 0.99, what is the worst-case value estimation error according to the simulation lemma? Assume Rmax=1R_{\max} = 1.

ExerciseAdvanced

Problem

MuZero does not train its dynamics model to predict observations, only to produce accurate value and policy predictions after KK steps of model rollout. Why is this better than training the model to minimize observation prediction error? Give a concrete example where the two objectives disagree.

ExerciseResearch

Problem

The simulation lemma gives an O(ϵH2)O(\epsilon H^2) error bound for planning with an imperfect model. Can you design a planning algorithm that achieves O(ϵH)O(\epsilon H) error instead? Under what additional assumptions?

Related Comparisons

References

Canonical:

  • Ha & Schmidhuber, "World Models" (NeurIPS 2018)
  • Schrittwieser et al., "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" (Nature 2020). MuZero

Current:

  • Hafner et al., "Mastering Diverse Domains through World Models" (2023). Dreamer v3
  • Yang et al., "Learning to Model the World with Language" (2024). language-augmented world models
  • Bruce et al., "Genie: Generative Interactive Environments" (2024). video world models

Next Topics

The natural next steps from world models:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics