Autoregressive vs. JEPA: Token vs Latent Prediction

What Each Does

Both autoregressive (AR) models and Joint Embedding Predictive Architectures (JEPA) learn to predict future information from context. They differ in what they predict and how they represent it.

Autoregressive models predict the next observable token (or pixel, or frame) in a sequence. The output is in the same space as the input. Generation is sequential: each prediction is fed back as input.

JEPA predicts the representation of a target from the representation of a context. The prediction happens in a learned abstract space, not in pixel or token space. JEPA does not generate observable outputs directly.

Side-by-Side Statement

Definition

Autoregressive Prediction

Given a sequence $x_1, \ldots, x_{t-1}$ , an autoregressive model parameterizes:

$p(x_t | x_1, \ldots, x_{t-1}) = \text{softmax}(W \cdot f_\theta(x_1, \ldots, x_{t-1}))$

The model is trained by maximizing the log-likelihood $\sum_t \log p(x_t | x_{<t})$ . At generation time, tokens are sampled one at a time and appended to the context.

Definition

JEPA Prediction

Given a context $x$ and target $y$ , JEPA learns:

$\hat{s}_y = g_\phi(s_x, z)$

where $s_x = f_\theta(x)$ is the context representation, $z$ specifies what to predict (e.g., a masked region), and $\hat{s}_y$ predicts the representation $s_y = \bar{f}_{\bar{\theta}}(y)$ produced by a target encoder. The loss is $\|g_\phi(s_x, z) - \bar{f}_{\bar{\theta}}(y)\|^2$ in representation space.

Where Each Is Stronger

Autoregressive wins for text generation

Autoregressive models dominate language. The discrete token vocabulary makes next-token prediction well-posed: there are finitely many possible outputs at each step, and cross-entropy loss is natural. The entire modern LLM stack (GPT, Llama, Claude) uses autoregressive prediction.

Autoregressive models also provide exact log-likelihoods, which enables principled evaluation via perplexity, and the sequential generation process naturally handles variable-length outputs.

JEPA wins for learning representations without generation

JEPA avoids predicting every low-level detail of the target. In images or video, predicting exact pixel values forces the model to model high-frequency noise and irrelevant texture. JEPA's representation-space prediction lets the model focus on semantic content.

I-JEPA (image JEPA) learns strong visual representations for classification without generating images. V-JEPA (video JEPA) learns temporal representations by predicting masked spatiotemporal regions in latent space. These representations transfer well to downstream tasks without fine-tuning on pixel-level generation.

Where Each Fails

Autoregressive fails on continuous, high-dimensional outputs

Predicting the exact next frame of a video pixel-by-pixel is intractable. The output space is too large and the mapping from context to future pixels is many-to-many (many plausible futures exist). This is why diffusion models rather than pure autoregressive models dominate image and video generation. Autoregressive models can work on quantized visual tokens, but the quantization introduces lossy compression.

JEPA fails at generation

JEPA does not model a distribution over observable outputs. It predicts abstract representations, and there is no standard way to decode these representations back to pixels or tokens. JEPA is a representation learning method, not a generative model. If your task requires generating text, images, or video, JEPA alone is insufficient.

JEPA requires careful collapse prevention

Without generating explicit outputs, JEPA risks representation collapse: both encoders can learn to map everything to a constant, achieving zero prediction loss trivially. Preventing collapse requires techniques like the exponential moving average (EMA) target encoder, variance and covariance regularization (VICReg), or asymmetric architectures. This engineering is nontrivial and failure modes are subtle.

Key Assumptions That Differ

	Autoregressive	JEPA
Prediction space	Observable (tokens, pixels)	Learned representations
Output	Generative (samples from $p(x_t \mid x_{<t})$ )	Representations only
Loss function	Cross-entropy or reconstruction	MSE in representation space
Collapse risk	None (discrete targets)	High (requires EMA, regularization)
Dominates	Text (LLMs)	Self-supervised vision, video understanding
Evaluation	Perplexity, generation quality	Probing accuracy, transfer learning

The Conceptual Divide: Generative vs. Discriminative World Models

Proposition

Complementary Inductive Biases

Statement

Autoregressive models learn a complete generative model $p(x_1, \ldots, x_T)$ that can be sampled from. This requires modeling all dependencies, including low-level details.

JEPA learns a predictive model in representation space that can capture high-level structure without modeling low-level details. The representation space is learned jointly with the prediction task, discarding information that is not useful for prediction.

These are complementary: AR models capture everything (including noise), while JEPA captures only what is predictable at the right level of abstraction.

Intuition

Consider predicting what happens next in a video of a ball being thrown. An autoregressive model must predict every pixel: the ball's position, the background texture, lighting variations, compression artifacts. JEPA only needs to predict that the ball moves along a parabolic trajectory in some abstract state space. The JEPA representation can discard irrelevant visual details and focus on the physics.

report a correction →

What to Memorize

Autoregressive: Predict next token/pixel in observation space. Generative. Dominates text.
JEPA: Predict representation in latent space. Not generative. Strong for vision understanding.
Why JEPA exists: Predicting raw pixels wastes capacity on irrelevant detail. Abstract prediction focuses on semantics.
Key tradeoff: AR models can generate; JEPA models learn better representations. You cannot easily have both.
Collapse: JEPA must actively prevent trivial solutions. AR models with discrete targets do not have this problem.

When a Researcher Would Use Each

Example

Building a chatbot or text generator

Use autoregressive models. Text is discrete, sequential, and variable-length. Next-token prediction with cross-entropy loss is the natural and dominant approach. JEPA has not shown competitive results for text.

Example

Learning visual representations for robotics

Use JEPA (or a JEPA variant). The robot needs to understand scenes and predict consequences of actions at an abstract level, not generate pixel-perfect images. V-JEPA representations capture temporal structure that transfers to control tasks.

Example

Building a world model for planning

This is the frontier. Action-conditioned video world models and latent-action systems generate or predict future frames under controls; JEPA-based world models predict abstract states conditioned on actions. The debate is unresolved: pixel-generating world models are easier to inspect visually, but JEPA-style models may avoid spending capacity on irrelevant visual detail.

Common Confusions

Watch Out

JEPA is not a masked autoencoder

Masked autoencoders (MAE) predict missing pixels. JEPA predicts missing representations. This distinction matters: MAE forces the model to reconstruct low-level details, while JEPA allows the model to discard information that is irrelevant for the prediction task. Empirically, I-JEPA reports strong downstream transfer on vision benchmarks; whether it is better than MAE depends on the task, scale, and evaluation protocol.

Watch Out

Autoregressive does not mean transformer

Autoregressive is a factorization of the joint distribution. Transformers are an architecture. You can build autoregressive models with RNNs, SSMs, or other architectures. You can also use transformers for non-autoregressive tasks. The two concepts are orthogonal, even though they frequently co-occur in LLMs.

Watch Out

JEPA and contrastive learning are not the same

Contrastive learning (SimCLR, CLIP) pulls together representations of related inputs and pushes apart unrelated ones. JEPA predicts the representation of one view from another without explicit negative pairs. JEPA-style methods may use target-network asymmetry, predictors, masking, and sometimes variance or covariance terms to avoid collapse. That is a different mechanism from contrastive negatives, not merely a renamed contrastive loss.