Beyond Llms
Video World Models
Turning pretrained video diffusion models into interactive world simulators: condition on actions to generate future frames, enabling RL agent training, robot planning, and game AI without physical environments.
Prerequisites
Why This Matters
Training RL agents (built on Markov decision processes) and robots in the real world is slow, expensive, and dangerous. Simulators are fast and safe but require hand-engineered physics and graphics. Video world models offer a third option: learn a simulator directly from video data, then train agents inside it.
The core idea is simple. A video generation model (often built on diffusion models) already knows how the visual world evolves over time. If you condition that model on actions (joystick inputs, motor commands), it becomes an interactive simulator that predicts what happens next given what you do.
Mental Model
A video world model is a learned transition function:
where is the history of observed frames and is the action taken at time . The model generates the next frame (or a latent representation of it) conditioned on past observations and the action. By chaining predictions autoregressively, you can roll out entire trajectories without interacting with the real environment.
Formal Setup and Notation
Video World Model
A video world model is a generative model that predicts future visual observations conditioned on past observations and actions. The model can be:
- Pixel-space: directly generates RGB frames
- Latent-space: generates compressed representations via a learned encoder/decoder pair
Training data consists of trajectories collected from real environments or existing video datasets.
Imagination-Based Planning
Given a world model , a policy , and a reward model , imagination-based planning evaluates a policy by:
where and . All rollouts happen inside the model. No real environment interaction is needed.
Core Definitions
The fidelity of a video world model is how closely its generated trajectories match real environment dynamics. High fidelity means agents trained in the model transfer well to the real world.
The controllability of a video world model measures how reliably actions produce the intended effects in generated frames. A model with low controllability may ignore action inputs and generate plausible but uncontrollable videos.
The horizon is how many steps the model can predict before error accumulates and predictions become unreliable. Compounding errors are the primary failure mode of autoregressive rollouts.
Main Theorems
Simulation Lemma for Learned World Models
Statement
Let be the true environment and a learned world model. If the total variation distance between one-step transition distributions satisfies for all , then for any policy :
where and are the expected returns in the true and model environments.
Intuition
Small per-step prediction errors compound over the planning horizon. The factor reflects this compounding: one factor of comes from the horizon length, and another from the accumulation of distributional shift. This bound motivates keeping rollout horizons short and model accuracy high.
Proof Sketch
Telescope the difference in value functions across time steps. At each step, the one-step error contributes at most in TV distance, which translates to a reward difference of at most . Summing over the geometric horizon gives the result.
Why It Matters
This bound quantifies the fundamental tradeoff in model-based RL: longer imagination rollouts give more data but accumulate more error. Practical systems like Dreamer use short rollout horizons (15-50 steps) precisely because of this compounding.
Failure Mode
The bound assumes uniform accuracy across all states and actions. In practice, the world model may be accurate in regions the agent has visited but poor in novel regions. Off-distribution states can produce arbitrarily bad predictions, making the uniform assumption unrealistic for exploration.
Key Approaches
Vid2World: Action-Conditioned Video Diffusion
Vid2World fine-tunes a pretrained video diffusion model to accept action inputs. The architecture modifies the denoising network to take as input instead of just . Key insight: the pretrained model already knows visual dynamics; fine-tuning teaches it to condition on actions.
Genie: World Models from Internet Video
Genie learns a world model from unlabeled internet video (no action labels). It infers a latent action space from observed transitions: if the camera moves left, that becomes a latent action. At inference time, users provide actions in this learned latent space to interact with the model. This removes the need for action-labeled training data.
Dreamer: Latent-Space World Models for RL
Dreamer (V1 through V3) learns a world model in a compact latent space rather than pixel space. The Recurrent State-Space Model (RSSM) maintains a latent state and predicts given and . A decoder maps latent states back to pixels for visualization. Training the policy and value function happens entirely in latent space, which is computationally cheaper than pixel-space rollouts.
Canonical Examples
Training an Atari agent inside a world model
Dreamer-V3 achieves human-level performance on Atari by: (1) collecting a small amount of real environment data, (2) training an RSSM world model on this data, (3) rolling out thousands of imagined trajectories inside the model, (4) training a policy using these imagined trajectories. The agent requires 10-100x fewer real environment steps than model-free methods because most training happens in imagination.
Common Confusions
Video generation is not the same as a world model
SORA and similar video generation models produce visually impressive videos but do not accept action inputs. Without action conditioning, they cannot serve as interactive simulators. A world model must respond to actions; a video generator only needs to produce plausible continuations.
High visual fidelity does not imply good dynamics
A video world model can produce photorealistic frames that violate physics. A ball might pass through a wall or an object might teleport. Visual quality and dynamical accuracy are separate properties. For RL training, dynamical accuracy matters far more than visual quality.
Compounding error is not just noise accumulation
Each prediction step shifts the state distribution. After steps, the model may be in a region of state space it has never seen during training. The error at step is not times the one-step error; it can be much worse because the model is extrapolating. This distributional shift is qualitatively different from additive noise.
Exercises
Problem
The simulation lemma gives a bound of for the value difference between true and model environments. If and you want the value difference to be at most 0.1, what per-step TV accuracy do you need?
Problem
Genie learns a latent action space from unlabeled video. What assumptions about the video data are needed for the inferred actions to be meaningful? Can you construct a dataset where the inferred latent actions are completely unrelated to any physically meaningful control?
References
Canonical:
- Ha & Schmidhuber, World Models (2018), arXiv:1803.10122
- Hafner et al., Dreamer V3 (2023), arXiv:2301.04104
Current:
-
Bruce et al., Genie: Generative Interactive Environments (2024), arXiv:2402.15391
-
Yang et al., Video as World Models (2024), survey of action-conditioned video generation
-
Zhang et al., Dive into Deep Learning (2023), Chapters 14-17
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- World Models and PlanningLayer 4
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Diffusion ModelsLayer 4
- Variational AutoencodersLayer 3
- AutoencodersLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Vectors, Matrices, and Linear MapsLayer 0A
- Maximum Likelihood EstimationLayer 0B