Beyond Llms
World Model Evaluation
How to measure whether a learned world model is useful: prediction accuracy, controllability (sim-to-real transfer), planning quality, and why long-horizon evaluation is hard.
Prerequisites
Why This Matters
A world model is only useful if policies trained inside it work in the real world. High prediction accuracy on single-step forecasts does not guarantee that multi-step plans will succeed. Evaluating world models requires metrics that capture what matters for downstream decision-making, not just perceptual fidelity.
Mental Model
Think of three levels of evaluation, from weakest to strongest:
- Prediction accuracy: does the model predict the next observation correctly?
- Controllability: can a policy trained entirely in the model succeed in the real environment?
- Planning quality: do plans optimized in the model lead to high real-world reward?
Each level is strictly harder than the previous. A model can score well on prediction but fail on controllability.
Formal Setup
World Model
A learned world model approximates the true environment dynamics . The model may predict in observation space (pixel-level), latent space (learned representation), or a mixture.
Single-Step Prediction Error
The single-step prediction error measures how well the model predicts the immediate next state:
where is a divergence measure (KL, Wasserstein, MSE in observation space, etc.) and is a distribution over state-action pairs.
Main Theorems
Compounding Error in Model-Based RL
Statement
Let be a learned model with single-step total variation error at most under any policy. Let be a policy optimized in achieving imagined return . Then the true return satisfies:
for horizon , or with effective horizon in the discounted infinite case:
Intuition
Each step, the model's state distribution drifts from reality by at most in TV distance. Over steps, these errors accumulate (not multiply, thanks to the triangle inequality for TV distance, but the rewards at each step are affected). The dependence is because the effective horizon is and each step contributes error.
Proof Sketch
Apply a simulation lemma: the difference in value functions between two MDPs is bounded by the expected sum of TV distances between their transition distributions along the trajectory. With per-step TV error and effective steps, the bound follows by telescoping.
Why It Matters
This bound quantifies the sim-to-real gap. It shows that even small per-step errors can compound into large value function errors for long horizons. Reducing by a factor of 2 reduces the planning error by a factor of 2, but extending the horizon by a factor of 2 also doubles the error. This is why model-based RL with learned models works best at short horizons.
Failure Mode
The bound is worst-case and often loose. In practice, errors may cancel across steps (the trajectory stays in a region where the model is accurate) or the policy may be robust to small perturbations. Conversely, the bound assumes uniform , but models are typically worse in rarely-visited states, so on-policy error can be much larger than average error.
Evaluation Metrics in Practice
Prediction Metrics
- MSE / PSNR: for pixel-level predictions. Easy to compute but does not capture perceptual quality.
- SSIM: structural similarity, slightly better than MSE for images.
- FVD (Frechet Video Distance): compares distributions of generated and real video clips using a pretrained feature extractor. The standard metric for video world models.
- LPIPS: learned perceptual image patch similarity. Correlates better with human judgment than MSE.
Controllability Metrics
- Sim-to-real transfer rate: train a policy in the model, deploy in reality, measure success rate.
- Reward correlation: does the rank ordering of policies by imagined reward match their rank ordering by real reward?
Planning Metrics
- Closed-loop return: run model-predictive control (MPC) with the learned model as the simulator, measure real-world return.
- Open-loop trajectory error: predict steps into the future and compare to reality. This degrades quickly for large .
Common Confusions
Good prediction does not imply good planning
A model can predict video frames with low FVD but still produce bad plans. This happens when the model's errors are concentrated on decision-relevant features (object positions, contact events) rather than visual details (textures, lighting). Evaluation must target the features that matter for the task.
Training loss is not evaluation
A world model's training loss (e.g., reconstruction loss, KL divergence in a VAE) measures in-distribution fit. Evaluation must test out-of-distribution generalization, especially for states the policy will visit that differ from the training data. A model can have low training loss and high planning error if the policy exploits model inaccuracies.
FVD evaluates distributions, not individual trajectories
FVD compares the distribution of generated videos to the distribution of real videos. A high FVD can result from the model generating plausible but wrong trajectories (mode mixing) or from missing some modes of the real distribution. It does not tell you whether a specific generated trajectory matches a specific real trajectory.
Exercises
Problem
A learned model has single-step prediction error in total variation. Using the compounding error bound, estimate the maximum value function error for a policy with effective horizon and .
Problem
A researcher reports that their world model achieves state-of-the-art FVD on a video prediction benchmark but the model-based RL agent trained in it performs worse than a model-free baseline. Give three possible explanations for this discrepancy.
References
Canonical:
- Janner et al., "When to Trust Your Model: Model-Based Policy Optimization," NeurIPS (2019)
- Luo et al., "A Survey on Model-Based Reinforcement Learning" (2022), arXiv:2206.09328
Current:
-
Unterthiner et al., "FVD: A New Metric for Video Generation," ICLR Workshop (2019)
-
Hafner et al., "Mastering Diverse Domains through World Models," arXiv:2301.04104 (2023)
-
Zhang et al., Dive into Deep Learning (2023), Chapters 14-17
Next Topics
- Model-based RL: using world models for policy optimization
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- World Models and PlanningLayer 4
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A