Video World Models

Sneiderman, Robby

Beyond LLMS

Video World Models

Turning pretrained video diffusion models into interactive world simulators: condition on actions to generate future frames, enabling RL agent training, robot planning, and game AI without physical environments.

AdvancedTier 2FrontierFrontier watch~50 min

Prerequisites

World Models and Planning Diffusion Models

Prereq Map

Learning position

Read this page in the graph.

beyond-llms | layer 5 | tier 2. This page has 2 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

World Models and Planning

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Training RL agents (built on Markov decision processes) and robots in the real world is slow, expensive, and dangerous. Simulators are fast and safe but require hand-engineered physics and graphics. Video world models offer a third option: learn a simulator directly from video data, then train agents inside it.

The core idea is simple. A video generation model (often built on diffusion models) already knows how the visual world evolves over time. If you condition that model on actions (joystick inputs, motor commands), it becomes an interactive simulator that predicts what happens next given what you do.

A video world model is only useful when action control survives the rollout

Tokenization, latent rollout, and decoding are forward-computable. The hard part is keeping the future both temporally consistent and genuinely action-conditioned after several generated steps.

The amber controllability factor is what separates a usable simulator from a passive video prior. The violet temporal factor is what keeps the rollout coherent over multiple generated steps.

Diagram language

filled circle = observed variable or chosen action

hollow circle = latent variable the system must infer or roll forward

rectangle = loss or compatibility term in the rollout objective

rounded-side box = forward-computable function

This notation follows the same April 2026 Yann LeCun framing: world models as energy-based factor graphs, with explicit symbols for observed variables, latent states, objective factors, and forward-computable neural functions.

What is computable directly

The tokenizer, rollout model, and decoder are ordinary neural functions. Once the current context and action are fixed, they produce latent futures and decoded frames by forward propagation.

What still fails in practice

Pretty frames do not imply controllable simulation. The action edge can go weak long before the model stops looking photorealistic.

Why horizon is still fragile

$Useful rollout depth is bounded by compounding transition error, not by visual quality alone. Long horizons demand both temporal coherence and persistent action-conditioning.$

This diagram language follows classical factor-graph notation and Yann LeCun's April 2026 description of world models as energy-based factor graphs: filled circles for observed quantities, hollow circles for latent states, rectangles for factors, and rounded-side boxes for forward-computable functions.

Mental Model

A video world model is a learned transition function:

$\hat{x}_{t+1} = f_\theta(x_{\leq t}, a_t)$

where $x_{\leq t}$ is the history of observed frames and $a_t$ is the action taken at time $t$ . The model generates the next frame (or a latent representation of it) conditioned on past observations and the action. By chaining predictions autoregressively, you can roll out entire trajectories without interacting with the real environment.

Formal Setup and Notation

Definition

Video World Model

A video world model is a generative model $p_\theta(x_{t+1} | x_{\leq t}, a_t)$ that predicts future visual observations conditioned on past observations and actions. The model can be:

Pixel-space: directly generates RGB frames
Latent-space: generates compressed representations via a learned encoder/decoder pair

Training data consists of trajectories $(x_0, a_0, x_1, a_1, \ldots, x_T)$ collected from real environments or existing video datasets.

Definition

Imagination-Based Planning

Given a world model $p_\theta$ , a policy $\pi$ , and a reward model $r_\phi$ , imagination-based planning evaluates a policy by:

$\hat{J}(\pi) = \mathbb{E}_{x_0 \sim p_0} \left[ \sum_{t=0}^{H} \gamma^t r_\phi(\hat{x}_t, a_t) \right]$

where $a_t \sim \pi(\hat{x}_t)$ and $\hat{x}_{t+1} \sim p_\theta(\cdot | \hat{x}_{\leq t}, a_t)$ . All rollouts happen inside the model. No real environment interaction is needed.

Core Definitions

The fidelity of a video world model is how closely its generated trajectories match real environment dynamics. High fidelity means agents trained in the model transfer well to the real world.

The controllability of a video world model measures how reliably actions produce the intended effects in generated frames. A model with low controllability may ignore action inputs and generate plausible but uncontrollable videos.

The horizon is how many steps the model can predict before error accumulates and predictions become unreliable. Compounding errors are the primary failure mode of autoregressive rollouts.

Main Theorems

Theorem

Simulation Lemma for Learned World Models

Statement

Let $M$ be the true environment and $\hat{M}$ a learned world model. Let $R_{\max} = \sup_{s, a} |r(s, a)|$ be the reward upper bound. Use the unnormalized total variation $\mathrm{TV}(P, Q) = \|P - Q\|_1$ (no $1/2$ factor). If $\mathrm{TV}(P_M(\cdot | s, a), P_{\hat{M}}(\cdot | s, a)) \leq \epsilon$ for all $(s, a)$ , then for any policy $\pi$ :

$|J_M(\pi) - J_{\hat{M}}(\pi)| \leq \frac{2 \gamma \epsilon R_{\max}}{(1-\gamma)^2}$

where $J_M(\pi)$ and $J_{\hat{M}}(\pi)$ are the expected returns in the true and model environments. Under the other common convention $\mathrm{TV}(P, Q) = \frac{1}{2}\|P - Q\|_1$ (range $[0, 1]$ ), the factor of 2 disappears and the bound is $\gamma \epsilon R_{\max} / (1-\gamma)^2$ . The $R_{\max}$ factor is essential for dimensional correctness: without it the bound compares a value (reward units) to a probability (unitless). Kearns & Singh 2002 "Near-Optimal Reinforcement Learning in Polynomial Time" state the original form; Kakade 2003 thesis uses the unnormalized-TV variant above.

Intuition

Small per-step prediction errors compound geometrically over the planning horizon. The $1/(1-\gamma)^2$ factor reflects this compounding: one factor of $1/(1-\gamma) \approx H$ (the effective horizon) comes from summing discounted rewards, and another from the accumulation of distributional shift across $H$ steps. In finite-horizon notation the scaling is $O(H^2 \epsilon R_{\max})$ cumulative value error. This bound motivates keeping rollout horizons short and model accuracy high.

Proof Sketch

Telescope the difference in value functions across time steps. At each step, the one-step error contributes at most $\epsilon$ in TV distance, which translates to a reward difference of at most $2 \epsilon R_{\max}$ (applying the TV-to-expectation bound $|\mathbb{E}_P f - \mathbb{E}_Q f| \leq \|f\|_\infty \mathrm{TV}(P, Q)$ with $\|r\|_\infty \leq R_{\max}$ ). Summing over the geometric horizon yields an extra factor of $1/(1-\gamma)$ from value propagation and another $1/(1-\gamma)$ from discounted summation.

Why It Matters

This bound quantifies the fundamental tradeoff in model-based RL: longer imagination rollouts give more data but accumulate more error. Practical systems like Dreamer use short rollout horizons (15-50 steps) precisely because of this compounding.

Failure Mode

The bound assumes uniform accuracy across all states and actions. In practice, the world model may be accurate in regions the agent has visited but poor in novel regions. Off-distribution states can produce arbitrarily bad predictions, making the uniform $\epsilon$ assumption unrealistic for exploration.

report a correction →

Key Approaches

GAIA-1: Action-Conditioned Video Diffusion for Driving

GAIA-1 (Hu et al., Wayve, 2023) conditions a large video generation model on actions and text for autonomous driving simulation. The architecture combines a discrete token world model with a video diffusion decoder, and accepts ego-vehicle actions (steering, speed) as conditioning. Key insight: a pretrained video backbone already captures visual dynamics, and fine-tuning with action-labeled driving data teaches it to respond to control inputs. Related systems include UniSim (Yang et al. 2023), which learns a universal action-conditioned simulator across robotics and driving domains, and Diamond (Alonso et al. 2024), which uses diffusion world models to train agents on Atari.

Genie family: from unlabeled video to playable worlds

Genie (Bruce et al., 2024) learns a world model from unlabeled internet video rather than action-labeled trajectories. It infers a latent action space from observed transitions: if the camera consistently moves left, that becomes part of the learned controllable basis. At inference time, users provide actions in this learned latent space to interact with the model.

Genie 2 and Genie 3 extend the same ambition toward larger-scale playable 3D environments. Those later systems matter, but the current public evidence comes primarily from official research reports and demos rather than the same style of paper-backed evaluation available for Genie 1. That distinction matters when judging transfer, consistency, and planning reliability.

Dreamer: Latent-Space World Models for RL

Dreamer (V1 through V3) learns a world model in a compact latent space rather than pixel space. The Recurrent State-Space Model (RSSM) maintains a latent state $z_t$ and predicts $z_{t+1}$ given $z_t$ and $a_t$ . A decoder maps latent states back to pixels for visualization. Training the policy and value function happens entirely in latent space, which is computationally cheaper than pixel-space rollouts.

Canonical Examples

Example

Training an Atari agent inside a world model

Dreamer-V3 achieves human-level performance on Atari by: (1) collecting a small amount of real environment data, (2) training an RSSM world model on this data, (3) rolling out thousands of imagined trajectories inside the model, (4) training a policy using these imagined trajectories. The agent requires 10-100x fewer real environment steps than model-free methods because most training happens in imagination.

Common Confusions

Watch Out

Video generation is not the same as a world model

Sora (OpenAI 2024) and similar video generation models produce visually realistic videos but do not accept fine-grained action inputs. Without action conditioning, they cannot serve as interactive simulators. A world model must respond to actions; a video generator only needs to produce plausible continuations. NVIDIA's Cosmos (2025) platform sits between these, packaging pretrained video foundation models for downstream action-conditioned fine-tuning.

Watch Out

High visual fidelity does not imply good dynamics

A video world model can produce photorealistic frames that violate physics. A ball might pass through a wall or an object might teleport. Visual quality and dynamical accuracy are separate properties. For RL training, dynamical accuracy matters far more than visual quality.

Watch Out

Compounding error is not just noise accumulation

Each prediction step shifts the state distribution. After $T$ steps, the model may be in a region of state space it has never seen during training. The error at step $T$ is not $T$ times the one-step error; it can be much worse because the model is extrapolating. This distributional shift is qualitatively different from additive noise.

Exercises

ExerciseAdvanced

Problem

The simulation lemma gives a bound of $\frac{2 \gamma \epsilon R_{\max}}{(1-\gamma)^2}$ for the value difference between true and model environments. Take $R_{\max} = 1$ . If $\gamma = 0.99$ and you want the value difference to be at most 0.1, what per-step TV accuracy $\epsilon$ do you need?

ExerciseResearch

Problem

Genie learns a latent action space from unlabeled video. What assumptions about the video data are needed for the inferred actions to be meaningful? Can you construct a dataset where the inferred latent actions are completely unrelated to any physically meaningful control?

References

Canonical:

Ha & Schmidhuber, World Models (2018), arXiv:1803.10122
Hafner et al., Mastering Diverse Domains through World Models (DreamerV3, 2023), arXiv:2301.04104
Schrittwieser et al., Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero, 2020), Nature, arXiv:1911.08265

Current action-conditioned video models:

Hu et al., GAIA-1: A Generative World Model for Autonomous Driving (Wayve, 2023), arXiv:2309.17080
Yang et al., Learning Interactive Real-World Simulators (UniSim, 2023), arXiv:2310.06114
Bruce et al., Genie: Generative Interactive Environments (2024), arXiv:2402.15391
Google DeepMind, Genie 2: A large-scale foundation world model (December 4, 2024), official research report
Google DeepMind, Genie 3: A new frontier for world models (2025), official system preview
Alonso et al., Diffusion for World Modeling: Visual Details Matter in Atari (Diamond, 2024), NeurIPS, arXiv:2405.12399
OpenAI, Video generation models as world simulators (Sora technical report, 2024)
NVIDIA, Cosmos World Foundation Model Platform for Physical AI (2025), arXiv:2501.03575
Yann LeCun, LinkedIn post "Oh yeah, world models are energy-based factor graphs" (April 2026). Short note on using filled circles for observed variables, hollow circles for latents, rectangles for factors, and rounded-side boxes for forward-computable functions.

Theory:

Kearns & Singh, Near-Optimal Reinforcement Learning in Polynomial Time (2002), Machine Learning 49, original simulation lemma with explicit $R_{\max}$
Kakade, On the Sample Complexity of Reinforcement Learning (2003), PhD thesis, Ch. 2 for the unnormalized-TV form of the simulation lemma used above
Janner et al., When to Trust Your Model: Model-Based Policy Optimization (MBPO, 2019), arXiv:1906.08253, quantifies the rollout-horizon vs model-error tradeoff

Last reviewed: May 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Diffusion Modelslayer 4 · tier 1
World Models and Planninglayer 4 · tier 2

Derived topics

1

Agentic RL and Tool Uselayer 5 · tier 2

Graph-backed continuations

Agentic RL and Tool Use