Beyond Llms
World Models and Planning
Learning a model of the environment and planning inside it: Dreamer's latent dynamics, MuZero's learned model with MCTS, video world models, and why planning in imagination is the path to sample-efficient and safe AI.
Prerequisites
Why This Matters
Model-free RL (Q-learning, PPO) treats the environment as a black box and learns purely from trial and error. This is sample-inefficient: DQN needs hundreds of millions of frames to learn Atari games that humans master in minutes. It is also unsafe: you cannot test an action before executing it.
World models invert this: learn a model of the environment, then plan inside it. Imagine trajectories, evaluate actions, and choose the best plan. all without touching the real environment. This is how humans navigate: we simulate outcomes mentally before acting. World models bring this capability to RL agents.
Mental Model
Think of a chess player analyzing a position. They do not need to play physical moves on a board. They simulate sequences of moves in their head, evaluate the resulting positions, and choose the best line. A world model is the learned "board" in the agent's head. Planning is the search over imagined move sequences.
The central tradeoff: a learned model is never perfect. Plans based on an imperfect model can be worse than model-free learning if the model errors compound over long horizons. The art of model-based RL is managing this tradeoff.
Formal Setup
Learned World Model
A learned world model consists of:
- A representation function mapping observations to latent states
- A dynamics model predicting the next latent state
- A reward predictor predicting immediate reward
- Optionally, a decoder reconstructing observations (used for training, not planning)
Given a current observation , the model can simulate forward: , , , and so on for any sequence of actions.
Planning
Planning uses the world model to select actions by searching over imagined trajectories. Given the current latent state , planning evaluates candidate action sequences by simulating them through and summing predicted rewards:
The agent executes the first action of the best plan and replans at the next step.
Simulation Lemma
The simulation lemma quantifies how model errors affect planning quality. If the model has per-step error (in transition prediction), then over a horizon , the value estimate error grows as in the worst case. This quadratic growth in horizon is the fundamental limitation of model-based planning.
Main Theorems
Model-Based RL Regret via Simulation Lemma
Statement
Let be a learned transition model with for all . Let be the policy obtained by planning optimally in . Then the performance gap between and the true optimal policy satisfies:
For a finite horizon , the bound becomes .
Intuition
Each step of planning introduces an error of order (the model is wrong by in TV distance). Over an effective horizon of steps, these errors accumulate. The dependence means that long-horizon problems (small ) amplify model errors quadratically. This is why model-based methods struggle with long-horizon planning unless the model is very accurate.
Proof Sketch
Decompose the value difference using a telescoping sum over time steps. At each step, the value under the true dynamics differs from the value under the model dynamics by at most . Summing over the effective horizon gives the result.
Why It Matters
This theorem explains both the promise and the limitation of world models. The promise: if is small, model-based methods can find near-optimal policies without ever executing suboptimal actions in the real environment (sample efficiency). The limitation: the quadratic dependence on horizon means small model errors become large planning errors over long time scales. This motivates learning in latent space (where models can be more accurate) and short-horizon planning with replanning.
Failure Mode
When model errors are correlated across states (systematic bias rather than random noise), the actual performance gap can be much larger than the worst-case bound. A model that consistently predicts slower dynamics, for example, produces systematically overconfident plans.
Dreamer: Latent World Models
The Dreamer family (v1, v2, v3) learns a latent-space world model and trains a policy entirely on imagined trajectories.
Architecture:
- Encoder : maps image observations to latent states
- Recurrent State Space Model (RSSM): combines deterministic recurrence with stochastic latent variables for dynamics prediction
- Reward predictor and continuation predictor (predicts episode termination)
- Decoder: reconstructs observations from latent states (for model training)
Training loop:
Dreamer Imagination-Based Policy Optimization
Statement
Dreamer optimizes the policy by maximizing the expected imagined return:
where the expectation is over trajectories generated by rolling out in the learned world model . The policy gradient is computed by backpropagating through the differentiable world model (no REINFORCE needed).
The value function is trained on imagined trajectories to compute -returns for the actor update, analogous to GAE in model-free actor-critic.
Intuition
Because the world model is a differentiable neural network, you can compute analytic gradients of the imagined return with respect to the policy parameters. This is structurally different from model-free policy gradients, which must estimate gradients from sampled rewards. Dreamer turns RL into supervised learning: the "data" is imagined trajectories, and the "labels" are the predicted rewards.
Why It Matters
Dreamer achieves state-of-the-art sample efficiency on visual control tasks. Training the policy on imagined data means the agent can improve without additional real-world interactions. Dreamer v3 matches or exceeds model-free methods across diverse domains (Atari, DMC, Minecraft) while using 10-50x fewer environment steps.
MuZero: Learned Model + Tree Search
MuZero (DeepMind, 2020) combines a learned model with Monte Carlo Tree Search (MCTS), achieving superhuman performance on Go, chess, shogi, and Atari without knowing the rules of any game.
Key components:
- Representation function : maps observation to initial latent state
- Dynamics function : given latent state and action, predicts next latent state and immediate reward
- Prediction function : given latent state, predicts policy and value (as in AlphaZero)
Critical insight: MuZero's dynamics function does not predict observations (pixels). It predicts latent states that are useful for planning. The model is trained end-to-end to produce accurate value and policy predictions after multiple steps of model rollout, not to reconstruct the environment faithfully.
The MCTS planning procedure uses the learned model to simulate forward and backpropagate value estimates through the search tree, just as AlphaZero does with the true game rules.
Video World Models
A recent frontier: using large pretrained video generation models as world simulators. The idea is that a model trained to predict future video frames has implicitly learned physics, object permanence, and dynamics.
Approach:
- Train (or use a pretrained) video diffusion model on large-scale video data
- Condition on the current frame and a proposed action (e.g., joystick input)
- Generate future frames as a simulation of the action's consequences
- Use the generated video for planning or policy training
Key challenges:
- Controllability: standard video models predict what will happen, not what happens given a specific action. Action-conditioned generation requires architectural changes or fine-tuning
- Consistency: generated videos can drift or hallucinate over long horizons
- Speed: diffusion-based generation is slow, limiting the number of imagined trajectories that can be evaluated for planning
This approach has shown promising results in game environments and simple robotic settings, but the computational cost and consistency challenges remain significant barriers for real-time planning.
Model-Free vs. Model-Based
| Model-Free | Model-Based | |
|---|---|---|
| Sample efficiency | Low (millions of steps) | High (thousands of steps) |
| Computation per step | Low | High (model rollouts + planning) |
| Asymptotic performance | Can be optimal | Limited by model accuracy |
| Safety | Must try dangerous actions | Can simulate before acting |
| Long-horizon tasks | Robust (no compounding error) | Degrades as horizon grows |
In practice, the best systems combine both: use a model for short-horizon planning and value estimation, but ground decisions in real experience to correct model errors. Dreamer exemplifies this: the model generates training data, but the policy is evaluated in the real environment.
Common Confusions
World models do not need to predict pixels
Early world models (Ha & Schmidhuber, 2018) generated pixel-level predictions. Modern approaches (MuZero, Dreamer) learn latent dynamics that never produce pixels during planning. The decoder is a training aid, not a planning component. Predicting in latent space is faster, more compact, and avoids wasting model capacity on irrelevant visual details.
Planning does not require a perfect model
A common objection is that model errors make planning useless. In reality, even crude models enable useful planning when combined with (1) short planning horizons with frequent replanning, (2) uncertainty estimation to avoid relying on uncertain predictions, and (3) real-world experience to correct model-based decisions. MuZero demonstrates superhuman performance despite imperfect latent dynamics.
LLMs are not world models in the RL sense
Language models can predict consequences of actions described in text, but they do not learn dynamics in a way that supports systematic search and planning. An RL world model must support repeated forward simulation at arbitrary action sequences, which current LLMs cannot do efficiently or accurately for physical environments. The relationship between LLM "world knowledge" and formal world models is an open research question.
Summary
- World models: learn , then plan by simulating imagined trajectories
- Simulation lemma: model error causes planning error. quadratic in effective horizon
- Dreamer: latent RSSM world model, policy trained on imagined trajectories, backpropagation through differentiable model
- MuZero: learned latent model + MCTS, does not predict observations, trained end-to-end for value/policy accuracy
- Video world models: pretrained video generators as environment simulators
- Model-based RL trades computation for sample efficiency
Exercises
Problem
If a learned model has per-step TV distance error and , what is the worst-case value estimation error according to the simulation lemma? Assume .
Problem
MuZero does not train its dynamics model to predict observations, only to produce accurate value and policy predictions after steps of model rollout. Why is this better than training the model to minimize observation prediction error? Give a concrete example where the two objectives disagree.
Problem
The simulation lemma gives an error bound for planning with an imperfect model. Can you design a planning algorithm that achieves error instead? Under what additional assumptions?
Related Comparisons
References
Canonical:
- Ha & Schmidhuber, "World Models" (NeurIPS 2018)
- Schrittwieser et al., "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" (Nature 2020). MuZero
Current:
- Hafner et al., "Mastering Diverse Domains through World Models" (2023). Dreamer v3
- Yang et al., "Learning to Model the World with Language" (2024). language-augmented world models
- Bruce et al., "Genie: Generative Interactive Environments" (2024). video world models
Next Topics
The natural next steps from world models:
- JEPA and joint embedding: predicting in representation space as the foundation for world models
- Agentic RL and tool use: applying world models and planning to autonomous agents
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
Builds on This
- Video World ModelsLayer 5
- World Model EvaluationLayer 5