Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Beyond Llms

JEPA and Joint Embedding

LeCun's Joint Embedding Predictive Architecture: learning by predicting abstract representations rather than pixels, energy-based formulation, I-JEPA and V-JEPA implementations, and the connection to contrastive learning and world models.

AdvancedTier 2Frontier~55 min
0

Why This Matters

Large language models learn by predicting the next token. every token, every time. This works for discrete text but is deeply wasteful for continuous data like images and video. Predicting every pixel of the next video frame forces the model to account for irrelevant details (exact lighting, texture noise) rather than high-level structure (an object is moving left).

JEPA (Joint Embedding Predictive Architecture) is Yann LeCun's proposal for how AI systems should learn from the physical world: predict in the space of abstract representations, not raw sensory data. This avoids the generation bottleneck and focuses the model on learning the structure that matters for downstream reasoning and planning.

Mental Model

Imagine two views of the world: a context view and a target view. Instead of reconstructing the target view pixel by pixel from the context, JEPA encodes both views into abstract representations and predicts the target representation from the context representation. The model never needs to generate pixels; it only needs to get the high-level content right.

This is like summarizing a scene rather than painting it. You predict what will happen (the ball will fall) without predicting how it looks at every pixel (the exact shadow under the ball).

Formal Setup

Definition

Joint Embedding Architecture

A joint embedding architecture maps two inputs xx and yy to representations sx=fθ(x)s_x = f_\theta(x) and sy=fθ(y)s_y = f_\theta(y) (or different encoders fθ,gϕf_\theta, g_\phi). An energy function E(sx,sy)E(s_x, s_y) measures compatibility: low energy means xx and yy are related, high energy means they are not.

The key distinction from generative models: there is no decoder that reconstructs yy from sxs_x. The loss operates entirely in representation space.

Definition

Joint Embedding Predictive Architecture (JEPA)

JEPA adds a predictor pψp_\psi between the context encoder and the target encoder. Given context input xx and target input yy:

  1. Encode context: sx=fθ(x)s_x = f_\theta(x)
  2. Encode target: sy=gϕ(y)s_y = g_\phi(y) (target encoder, often an EMA of fθf_\theta)
  3. Predict target representation: s^y=pψ(sx,z)\hat{s}_y = p_\psi(s_x, z) where zz specifies what to predict (e.g., which spatial region)
  4. Minimize prediction error in representation space: L=D(s^y,sy)\mathcal{L} = D(\hat{s}_y, s_y)

The predictor pψp_\psi operates in abstract space. It never sees or generates pixels.

Definition

Collapse

Representation collapse occurs when the encoders map all inputs to the same representation, trivially achieving zero prediction error. If fθ(x)=cf_\theta(x) = c for all xx, then s^y=c=sy\hat{s}_y = c = s_y regardless of input. Preventing collapse is the central technical challenge of joint embedding methods.

Main Theorems

Proposition

JEPA as Energy-Based Learning

Statement

The JEPA training objective can be written as an energy-based model:

L(θ,ϕ,ψ)=E(x,y)D[D(pψ(fθ(x),z),  sg[gϕ(y)])]\mathcal{L}(\theta, \phi, \psi) = \mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ D\big(p_\psi(f_\theta(x), z), \; \text{sg}[g_\phi(y)]\big) \right]

where DD is a distance metric (e.g., 2\ell_2 or cosine distance), sg[]\text{sg}[\cdot] denotes stop-gradient (no gradients flow through the target encoder), and gϕg_\phi is updated via exponential moving average: ϕτϕ+(1τ)θ\phi \leftarrow \tau \phi + (1-\tau) \theta.

The energy landscape has two failure modes:

  1. Collapse: energy is uniformly low (all representations identical)
  2. Fragmentation: energy is uniformly high (representations carry no information)

A well-trained JEPA has low energy for compatible (x,y)(x,y) pairs and high energy for incompatible pairs, with the energy surface shaped by the architecture and training dynamics rather than by an explicit contrastive loss.

Intuition

The JEPA objective pulls the predicted representation s^y\hat{s}_y toward the actual target representation sys_y, but only through the predictor and context encoder. The target encoder is a slowly moving anchor. This asymmetry prevents the trivial solution where both encoders collapse to a constant. The EMA update ensures the target representations change slowly, giving the predictor a stable learning signal.

Why It Matters

Unlike contrastive learning (which requires explicit negative pairs), JEPA uses architectural asymmetry and EMA to prevent collapse. This avoids the need for large batch sizes, hard negative mining, or memory banks. The energy-based framing also connects JEPA to a broader theory of self-supervised learning that LeCun argues is necessary for building world models.

Proposition

Collapse Prevention via Asymmetry

Statement

Consider the loss L=pψ(fθ(x))sg[gϕ(y)]2\mathcal{L} = \|p_\psi(f_\theta(x)) - \text{sg}[g_\phi(y)]\|^2 with ϕτϕ+(1τ)θ\phi \leftarrow \tau\phi + (1-\tau)\theta. The collapsed solution fθ(x)=cf_\theta(x) = c for all xx is a fixed point of the gradient dynamics, but empirically it is unstable when:

  1. The predictor pψp_\psi is sufficiently expressive but not trivially so
  2. The EMA momentum τ\tau is close to 1 (slow target updates)
  3. The data distribution has sufficient diversity

Formal analysis of this non-collapse phenomenon in the linear case shows that the encoder learns the top principal components of the data covariance, and collapse corresponds to an unstable equilibrium.

Intuition

If the target encoder moves slowly (high τ\tau), it provides a stable target that changes only gradually. The predictor and context encoder must do real work to match these targets. If the context encoder collapses, the predictor cannot distinguish between different targets, so the loss remains high. The EMA creates an implicit information-theoretic pressure: the encoder must retain enough information about the input to predict the (diverse) target representations.

Why It Matters

Understanding collapse prevention is essential for practical JEPA training. BYOL and DINO demonstrated that collapse can be avoided without negative pairs. JEPA inherits these techniques and adds structured prediction in representation space as a further source of non-trivial learning signal.

I-JEPA: Images

I-JEPA (Image JEPA) applies the JEPA framework to images using Vision Transformers. The key design choices:

  1. Context: a set of visible patches from an image
  2. Target: representations of masked patches (the patches the model must predict)
  3. Masking strategy: large, semantically meaningful blocks are masked (not random individual patches)
  4. Prediction: the predictor takes context patch representations plus positional information about the target patches, and outputs predictions in representation space

Critical difference from Masked Autoencoders (MAE): MAE reconstructs pixels of masked patches. I-JEPA predicts abstract representations of masked patches. This means I-JEPA does not need a pixel decoder and focuses on high-level structure rather than texture details.

I-JEPA produces representations that transfer well to downstream tasks (classification, detection) without the need for data augmentation, which is a significant departure from contrastive methods like SimCLR and DINO that rely heavily on augmentation-invariance.

V-JEPA: Video

V-JEPA extends the framework to video by masking spatiotemporal blocks. Given a video clip, a large fraction of space-time patches are masked, and the model predicts their representations from the visible patches.

The temporal dimension is what makes V-JEPA promising: predicting what will happen next (in representation space) is a form of learning physics and dynamics. The model must understand that objects persist, fall under gravity, and interact. all without predicting specific pixel values.

JEPA vs. Other Self-Supervised Methods

MethodPrediction spaceCollapse preventionRequires augmentationsRequires negatives
MAEPixel spaceNot applicable (reconstruction)NoNo
SimCLRRepresentation spaceContrastive loss (negatives)Yes (heavy)Yes
BYOLRepresentation spaceEMA + predictorYes (heavy)No
DINORepresentation spaceEMA + centeringYes (heavy)No
I-JEPARepresentation spaceEMA + predictor + maskingNoNo

The JEPA design removes both augmentation dependence and the need for negative pairs. This is significant because hand-crafted augmentations (random crops, color jitter) inject domain-specific inductive biases that may not transfer across modalities.

Connection to Contrastive Learning

Contrastive learning (SimCLR, MoCo) can be seen as a special case of joint embedding where collapse is prevented by explicitly pushing apart negative pairs. The InfoNCE loss:

LNCE=logexp(sim(sx,sy)/τ)kexp(sim(sx,sk)/τ)\mathcal{L}_{\text{NCE}} = -\log \frac{\exp(\text{sim}(s_x, s_y)/\tau)}{\sum_{k} \exp(\text{sim}(s_x, s_k)/\tau)}

requires a set of negative examples {sk}\{s_k\}. JEPA achieves the same goal (informative representations that do not collapse) without negatives, through the EMA and predictor mechanism. In this sense, JEPA subsumes contrastive learning as a special case of energy-based learning with a particular choice of energy shaping.

Common Confusions

Watch Out

JEPA is not a replacement for LLMs

JEPA is a self-supervised learning framework for continuous data (images, video, audio). It does not generate text, answer questions, or reason verbally. It is a research direction toward building world models. systems that understand physical dynamics. which could complement language models, not replace them. The hype around JEPA sometimes obscures this distinction.

Watch Out

Predicting representations is not easier than predicting pixels

Predicting in representation space avoids the generation bottleneck (no need to model irrelevant details) but introduces a new challenge: the representation space is learned simultaneously with the predictor. If the encoder learns trivial representations, prediction is easy but useless. The difficulty shifts from modeling pixels to preventing collapse and ensuring the representations capture meaningful structure.

Watch Out

JEPA is not just BYOL with masking

BYOL predicts the representation of the same input under a different augmentation. I-JEPA predicts the representation of a different part of the input (masked patches). This is a structural prediction task (what is in the missing region?) rather than an invariance task (the representation should not change under crops and color jitter). The masking formulation generalizes naturally to video and spatiotemporal prediction.

Summary

  • JEPA predicts in representation space, not pixel space. avoiding the generation bottleneck
  • The architecture: context encoder, target encoder (EMA), predictor
  • Collapse prevention via EMA, stop-gradient, and the predictor bottleneck
  • I-JEPA: masking image patches, predicting their representations
  • V-JEPA: masking spatiotemporal video patches
  • No augmentations and no negative pairs required. a departure from contrastive methods
  • JEPA is a framework for building world models, not a replacement for LLMs

Exercises

ExerciseCore

Problem

Explain why Masked Autoencoders (MAE) must model low-level details like texture while I-JEPA does not. What architectural difference causes this?

ExerciseAdvanced

Problem

Why is the EMA momentum parameter τ\tau critical for preventing collapse in JEPA? What happens if τ=0\tau = 0 (no momentum, target encoder equals context encoder) and what happens if τ=1\tau = 1 (target encoder never updates)?

ExerciseResearch

Problem

JEPA predicts representations of masked patches using positional information about which patches to predict. Could you extend this to predict representations of future video frames given past frames and a proposed action? Describe the architecture and explain how this relates to world models for planning.

Related Comparisons

References

Canonical:

  • LeCun, "A Path Towards Autonomous Machine Intelligence" (2022). The JEPA proposal
  • Assran et al., "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture" (CVPR 2023). I-JEPA

Current:

  • Bardes et al., "Revisiting Feature Prediction for Learning Visual Representations from Video" (2024). V-JEPA
  • Grill et al., "Bootstrap Your Own Latent" (NeurIPS 2020). BYOL, precursor to JEPA-style methods
  • Chen et al., "Exploring Simple Siamese Representation Learning" (CVPR 2021). analysis of collapse prevention

Next Topics

The natural next steps from JEPA:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics