Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Beyond Llms

World Model Evaluation

How to measure whether a learned world model is useful: prediction accuracy, controllability (sim-to-real transfer), planning quality, and why long-horizon evaluation is hard.

AdvancedTier 3Frontier~40 min
0

Why This Matters

A world model is only useful if policies trained inside it work in the real world. High prediction accuracy on single-step forecasts does not guarantee that multi-step plans will succeed. Evaluating world models requires metrics that capture what matters for downstream decision-making, not just perceptual fidelity.

Mental Model

Think of three levels of evaluation, from weakest to strongest:

  1. Prediction accuracy: does the model predict the next observation correctly?
  2. Controllability: can a policy trained entirely in the model succeed in the real environment?
  3. Planning quality: do plans optimized in the model lead to high real-world reward?

Each level is strictly harder than the previous. A model can score well on prediction but fail on controllability.

Formal Setup

Definition

World Model

A learned world model p^\hat{p} approximates the true environment dynamics p(st+1,rtst,at)p(s_{t+1}, r_t | s_t, a_t). The model may predict in observation space (pixel-level), latent space (learned representation), or a mixture.

Definition

Single-Step Prediction Error

The single-step prediction error measures how well the model predicts the immediate next state:

ϵ1=E(s,a)D[d(p^(s,a),p(s,a))]\epsilon_1 = \mathbb{E}_{(s,a) \sim \mathcal{D}} \left[ d(\hat{p}(\cdot | s, a), p(\cdot | s, a)) \right]

where dd is a divergence measure (KL, Wasserstein, MSE in observation space, etc.) and D\mathcal{D} is a distribution over state-action pairs.

Main Theorems

Proposition

Compounding Error in Model-Based RL

Statement

Let p^\hat{p} be a learned model with single-step total variation error at most ϵ\epsilon under any policy. Let π\pi be a policy optimized in p^\hat{p} achieving imagined return V^π\hat{V}^{\pi}. Then the true return VπV^{\pi} satisfies:

VπV^π2γϵT(1γ)|V^{\pi} - \hat{V}^{\pi}| \leq \frac{2\gamma \epsilon T}{(1 - \gamma)}

for horizon TT, or with effective horizon 1/(1γ)1/(1-\gamma) in the discounted infinite case:

VπV^π=O ⁣(ϵ(1γ)2)|V^{\pi} - \hat{V}^{\pi}| = O\!\left(\frac{\epsilon}{(1-\gamma)^2}\right)

Intuition

Each step, the model's state distribution drifts from reality by at most ϵ\epsilon in TV distance. Over TT steps, these errors accumulate (not multiply, thanks to the triangle inequality for TV distance, but the rewards at each step are affected). The 1/(1γ)21/(1-\gamma)^2 dependence is because the effective horizon is 1/(1γ)1/(1-\gamma) and each step contributes O(ϵ/(1γ))O(\epsilon/(1-\gamma)) error.

Proof Sketch

Apply a simulation lemma: the difference in value functions between two MDPs is bounded by the expected sum of TV distances between their transition distributions along the trajectory. With per-step TV error ϵ\epsilon and TT effective steps, the bound follows by telescoping.

Why It Matters

This bound quantifies the sim-to-real gap. It shows that even small per-step errors can compound into large value function errors for long horizons. Reducing ϵ\epsilon by a factor of 2 reduces the planning error by a factor of 2, but extending the horizon by a factor of 2 also doubles the error. This is why model-based RL with learned models works best at short horizons.

Failure Mode

The bound is worst-case and often loose. In practice, errors may cancel across steps (the trajectory stays in a region where the model is accurate) or the policy may be robust to small perturbations. Conversely, the bound assumes uniform ϵ\epsilon, but models are typically worse in rarely-visited states, so on-policy error can be much larger than average error.

Evaluation Metrics in Practice

Prediction Metrics

  • MSE / PSNR: for pixel-level predictions. Easy to compute but does not capture perceptual quality.
  • SSIM: structural similarity, slightly better than MSE for images.
  • FVD (Frechet Video Distance): compares distributions of generated and real video clips using a pretrained feature extractor. The standard metric for video world models.
  • LPIPS: learned perceptual image patch similarity. Correlates better with human judgment than MSE.

Controllability Metrics

  • Sim-to-real transfer rate: train a policy in the model, deploy in reality, measure success rate.
  • Reward correlation: does the rank ordering of policies by imagined reward match their rank ordering by real reward?

Planning Metrics

  • Closed-loop return: run model-predictive control (MPC) with the learned model as the simulator, measure real-world return.
  • Open-loop trajectory error: predict TT steps into the future and compare to reality. This degrades quickly for large TT.

Common Confusions

Watch Out

Good prediction does not imply good planning

A model can predict video frames with low FVD but still produce bad plans. This happens when the model's errors are concentrated on decision-relevant features (object positions, contact events) rather than visual details (textures, lighting). Evaluation must target the features that matter for the task.

Watch Out

Training loss is not evaluation

A world model's training loss (e.g., reconstruction loss, KL divergence in a VAE) measures in-distribution fit. Evaluation must test out-of-distribution generalization, especially for states the policy will visit that differ from the training data. A model can have low training loss and high planning error if the policy exploits model inaccuracies.

Watch Out

FVD evaluates distributions, not individual trajectories

FVD compares the distribution of generated videos to the distribution of real videos. A high FVD can result from the model generating plausible but wrong trajectories (mode mixing) or from missing some modes of the real distribution. It does not tell you whether a specific generated trajectory matches a specific real trajectory.

Exercises

ExerciseCore

Problem

A learned model has single-step prediction error ϵ=0.01\epsilon = 0.01 in total variation. Using the compounding error bound, estimate the maximum value function error for a policy with effective horizon T=100T = 100 and γ=0.99\gamma = 0.99.

ExerciseAdvanced

Problem

A researcher reports that their world model achieves state-of-the-art FVD on a video prediction benchmark but the model-based RL agent trained in it performs worse than a model-free baseline. Give three possible explanations for this discrepancy.

References

Canonical:

  • Janner et al., "When to Trust Your Model: Model-Based Policy Optimization," NeurIPS (2019)
  • Luo et al., "A Survey on Model-Based Reinforcement Learning" (2022), arXiv:2206.09328

Current:

  • Unterthiner et al., "FVD: A New Metric for Video Generation," ICLR Workshop (2019)

  • Hafner et al., "Mastering Diverse Domains through World Models," arXiv:2301.04104 (2023)

  • Zhang et al., Dive into Deep Learning (2023), Chapters 14-17

Next Topics

  • Model-based RL: using world models for policy optimization

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.