Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Autoregressive Models vs. JEPA

Two competing paradigms for learning world models: autoregressive models predict raw tokens or pixels sequentially, while JEPA predicts abstract representations in a learned latent space without generating observable outputs.

What Each Does

Both autoregressive (AR) models and Joint Embedding Predictive Architectures (JEPA) learn to predict future information from context. They differ in what they predict and how they represent it.

Autoregressive models predict the next observable token (or pixel, or frame) in a sequence. The output is in the same space as the input. Generation is sequential: each prediction is fed back as input.

JEPA predicts the representation of a target from the representation of a context. The prediction happens in a learned abstract space, not in pixel or token space. JEPA does not generate observable outputs directly.

Side-by-Side Statement

Definition

Autoregressive Prediction

Given a sequence x1,,xt1x_1, \ldots, x_{t-1}, an autoregressive model parameterizes:

p(xtx1,,xt1)=softmax(Wfθ(x1,,xt1))p(x_t | x_1, \ldots, x_{t-1}) = \text{softmax}(W \cdot f_\theta(x_1, \ldots, x_{t-1}))

The model is trained by maximizing the log-likelihood tlogp(xtx<t)\sum_t \log p(x_t | x_{<t}). At generation time, tokens are sampled one at a time and appended to the context.

Definition

JEPA Prediction

Given a context xx and target yy, JEPA learns:

s^y=gϕ(sx,z)\hat{s}_y = g_\phi(s_x, z)

where sx=fθ(x)s_x = f_\theta(x) is the context representation, zz specifies what to predict (e.g., a masked region), and s^y\hat{s}_y predicts the representation sy=fˉθˉ(y)s_y = \bar{f}_{\bar{\theta}}(y) produced by a target encoder. The loss is gϕ(sx,z)fˉθˉ(y)2\|g_\phi(s_x, z) - \bar{f}_{\bar{\theta}}(y)\|^2 in representation space.

Where Each Is Stronger

Autoregressive wins for text generation

Autoregressive models dominate language. The discrete token vocabulary makes next-token prediction well-posed: there are finitely many possible outputs at each step, and cross-entropy loss is natural. The entire modern LLM stack (GPT, Llama, Claude) uses autoregressive prediction.

Autoregressive models also provide exact log-likelihoods, which enables principled evaluation via perplexity, and the sequential generation process naturally handles variable-length outputs.

JEPA wins for learning representations without generation

JEPA avoids predicting every low-level detail of the target. In images or video, predicting exact pixel values forces the model to model high-frequency noise and irrelevant texture. JEPA's representation-space prediction lets the model focus on semantic content.

I-JEPA (image JEPA) learns strong visual representations for classification without generating images. V-JEPA (video JEPA) learns temporal representations by predicting masked spatiotemporal regions in latent space. These representations transfer well to downstream tasks without fine-tuning on pixel-level generation.

Where Each Fails

Autoregressive fails on continuous, high-dimensional outputs

Predicting the exact next frame of a video pixel-by-pixel is intractable. The output space is too large and the mapping from context to future pixels is many-to-many (many plausible futures exist). This is why diffusion models rather than pure autoregressive models dominate image and video generation. Autoregressive models can work on quantized visual tokens, but the quantization introduces lossy compression.

JEPA fails at generation

JEPA does not model a distribution over observable outputs. It predicts abstract representations, and there is no standard way to decode these representations back to pixels or tokens. JEPA is a representation learning method, not a generative model. If your task requires generating text, images, or video, JEPA alone is insufficient.

JEPA requires careful collapse prevention

Without generating explicit outputs, JEPA risks representation collapse: both encoders can learn to map everything to a constant, achieving zero prediction loss trivially. Preventing collapse requires techniques like the exponential moving average (EMA) target encoder, variance and covariance regularization (VICReg), or asymmetric architectures. This engineering is nontrivial and failure modes are subtle.

Key Assumptions That Differ

AutoregressiveJEPA
Prediction spaceObservable (tokens, pixels)Learned representations
OutputGenerative (samples from p(xtx<t)p(x_t \mid x_{<t}))Representations only
Loss functionCross-entropy or reconstructionMSE in representation space
Collapse riskNone (discrete targets)High (requires EMA, regularization)
DominatesText (LLMs)Self-supervised vision, video understanding
EvaluationPerplexity, generation qualityProbing accuracy, transfer learning

The Conceptual Divide: Generative vs. Discriminative World Models

Theorem

Complementary Inductive Biases

Statement

Autoregressive models learn a complete generative model p(x1,,xT)p(x_1, \ldots, x_T) that can be sampled from. This requires modeling all dependencies, including low-level details.

JEPA learns a predictive model in representation space that can capture high-level structure without modeling low-level details. The representation space is learned jointly with the prediction task, discarding information that is not useful for prediction.

These are complementary: AR models capture everything (including noise), while JEPA captures only what is predictable at the right level of abstraction.

Intuition

Consider predicting what happens next in a video of a ball being thrown. An autoregressive model must predict every pixel: the ball's position, the background texture, lighting variations, compression artifacts. JEPA only needs to predict that the ball moves along a parabolic trajectory in some abstract state space. The JEPA representation can discard irrelevant visual details and focus on the physics.

What to Memorize

  1. Autoregressive: Predict next token/pixel in observation space. Generative. Dominates text.

  2. JEPA: Predict representation in latent space. Not generative. Strong for vision understanding.

  3. Why JEPA exists: Predicting raw pixels wastes capacity on irrelevant detail. Abstract prediction focuses on semantics.

  4. Key tradeoff: AR models can generate; JEPA models learn better representations. You cannot easily have both.

  5. Collapse: JEPA must actively prevent trivial solutions. AR models with discrete targets do not have this problem.

When a Researcher Would Use Each

Example

Building a chatbot or text generator

Use autoregressive models. Text is discrete, sequential, and variable-length. Next-token prediction with cross-entropy loss is the natural and dominant approach. JEPA has not shown competitive results for text.

Example

Learning visual representations for robotics

Use JEPA (or a JEPA variant). The robot needs to understand scenes and predict consequences of actions at an abstract level, not generate pixel-perfect images. V-JEPA representations capture temporal structure that transfers to control tasks.

Example

Building a world model for planning

This is the frontier. Autoregressive video models (Genie, SORA) generate future frames conditioned on actions. JEPA-based world models predict abstract states conditioned on actions. The debate is unresolved: AR world models are more interpretable (you can visualize predictions), but JEPA world models may be more computationally efficient and robust.

Common Confusions

Watch Out

JEPA is not a masked autoencoder

Masked autoencoders (MAE) predict missing pixels. JEPA predicts missing representations. This distinction matters: MAE forces the model to reconstruct low-level details, while JEPA allows the model to discard information that is irrelevant for the prediction task. Empirically, JEPA produces representations with better downstream transfer than MAE.

Watch Out

Autoregressive does not mean transformer

Autoregressive is a factorization of the joint distribution. Transformers are an architecture. You can build autoregressive models with RNNs, SSMs, or other architectures. You can also use transformers for non-autoregressive tasks. The two concepts are orthogonal, even though they frequently co-occur in LLMs.

Watch Out

JEPA and contrastive learning are not the same

Contrastive learning (SimCLR, CLIP) pulls together representations of related inputs and pushes apart unrelated ones. JEPA predicts the representation of one view from another without explicit negative pairs. The VICReg regularization in JEPA prevents collapse through variance and covariance constraints rather than contrastive negatives.