Beyond Llms
JEPA and Joint Embedding
LeCun's Joint Embedding Predictive Architecture: learning by predicting abstract representations rather than pixels, energy-based formulation, I-JEPA and V-JEPA implementations, and the connection to contrastive learning and world models.
Prerequisites
Why This Matters
Large language models learn by predicting the next token. every token, every time. This works for discrete text but is deeply wasteful for continuous data like images and video. Predicting every pixel of the next video frame forces the model to account for irrelevant details (exact lighting, texture noise) rather than high-level structure (an object is moving left).
JEPA (Joint Embedding Predictive Architecture) is Yann LeCun's proposal for how AI systems should learn from the physical world: predict in the space of abstract representations, not raw sensory data. This avoids the generation bottleneck and focuses the model on learning the structure that matters for downstream reasoning and planning.
Mental Model
Imagine two views of the world: a context view and a target view. Instead of reconstructing the target view pixel by pixel from the context, JEPA encodes both views into abstract representations and predicts the target representation from the context representation. The model never needs to generate pixels; it only needs to get the high-level content right.
This is like summarizing a scene rather than painting it. You predict what will happen (the ball will fall) without predicting how it looks at every pixel (the exact shadow under the ball).
Formal Setup
Joint Embedding Architecture
A joint embedding architecture maps two inputs and to representations and (or different encoders ). An energy function measures compatibility: low energy means and are related, high energy means they are not.
The key distinction from generative models: there is no decoder that reconstructs from . The loss operates entirely in representation space.
Joint Embedding Predictive Architecture (JEPA)
JEPA adds a predictor between the context encoder and the target encoder. Given context input and target input :
- Encode context:
- Encode target: (target encoder, often an EMA of )
- Predict target representation: where specifies what to predict (e.g., which spatial region)
- Minimize prediction error in representation space:
The predictor operates in abstract space. It never sees or generates pixels.
Collapse
Representation collapse occurs when the encoders map all inputs to the same representation, trivially achieving zero prediction error. If for all , then regardless of input. Preventing collapse is the central technical challenge of joint embedding methods.
Main Theorems
JEPA as Energy-Based Learning
Statement
The JEPA training objective can be written as an energy-based model:
where is a distance metric (e.g., or cosine distance), denotes stop-gradient (no gradients flow through the target encoder), and is updated via exponential moving average: .
The energy landscape has two failure modes:
- Collapse: energy is uniformly low (all representations identical)
- Fragmentation: energy is uniformly high (representations carry no information)
A well-trained JEPA has low energy for compatible pairs and high energy for incompatible pairs, with the energy surface shaped by the architecture and training dynamics rather than by an explicit contrastive loss.
Intuition
The JEPA objective pulls the predicted representation toward the actual target representation , but only through the predictor and context encoder. The target encoder is a slowly moving anchor. This asymmetry prevents the trivial solution where both encoders collapse to a constant. The EMA update ensures the target representations change slowly, giving the predictor a stable learning signal.
Why It Matters
Unlike contrastive learning (which requires explicit negative pairs), JEPA uses architectural asymmetry and EMA to prevent collapse. This avoids the need for large batch sizes, hard negative mining, or memory banks. The energy-based framing also connects JEPA to a broader theory of self-supervised learning that LeCun argues is necessary for building world models.
Collapse Prevention via Asymmetry
Statement
Consider the loss with . The collapsed solution for all is a fixed point of the gradient dynamics, but empirically it is unstable when:
- The predictor is sufficiently expressive but not trivially so
- The EMA momentum is close to 1 (slow target updates)
- The data distribution has sufficient diversity
Formal analysis of this non-collapse phenomenon in the linear case shows that the encoder learns the top principal components of the data covariance, and collapse corresponds to an unstable equilibrium.
Intuition
If the target encoder moves slowly (high ), it provides a stable target that changes only gradually. The predictor and context encoder must do real work to match these targets. If the context encoder collapses, the predictor cannot distinguish between different targets, so the loss remains high. The EMA creates an implicit information-theoretic pressure: the encoder must retain enough information about the input to predict the (diverse) target representations.
Why It Matters
Understanding collapse prevention is essential for practical JEPA training. BYOL and DINO demonstrated that collapse can be avoided without negative pairs. JEPA inherits these techniques and adds structured prediction in representation space as a further source of non-trivial learning signal.
I-JEPA: Images
I-JEPA (Image JEPA) applies the JEPA framework to images using Vision Transformers. The key design choices:
- Context: a set of visible patches from an image
- Target: representations of masked patches (the patches the model must predict)
- Masking strategy: large, semantically meaningful blocks are masked (not random individual patches)
- Prediction: the predictor takes context patch representations plus positional information about the target patches, and outputs predictions in representation space
Critical difference from Masked Autoencoders (MAE): MAE reconstructs pixels of masked patches. I-JEPA predicts abstract representations of masked patches. This means I-JEPA does not need a pixel decoder and focuses on high-level structure rather than texture details.
I-JEPA produces representations that transfer well to downstream tasks (classification, detection) without the need for data augmentation, which is a significant departure from contrastive methods like SimCLR and DINO that rely heavily on augmentation-invariance.
V-JEPA: Video
V-JEPA extends the framework to video by masking spatiotemporal blocks. Given a video clip, a large fraction of space-time patches are masked, and the model predicts their representations from the visible patches.
The temporal dimension is what makes V-JEPA promising: predicting what will happen next (in representation space) is a form of learning physics and dynamics. The model must understand that objects persist, fall under gravity, and interact. all without predicting specific pixel values.
JEPA vs. Other Self-Supervised Methods
| Method | Prediction space | Collapse prevention | Requires augmentations | Requires negatives |
|---|---|---|---|---|
| MAE | Pixel space | Not applicable (reconstruction) | No | No |
| SimCLR | Representation space | Contrastive loss (negatives) | Yes (heavy) | Yes |
| BYOL | Representation space | EMA + predictor | Yes (heavy) | No |
| DINO | Representation space | EMA + centering | Yes (heavy) | No |
| I-JEPA | Representation space | EMA + predictor + masking | No | No |
The JEPA design removes both augmentation dependence and the need for negative pairs. This is significant because hand-crafted augmentations (random crops, color jitter) inject domain-specific inductive biases that may not transfer across modalities.
Connection to Contrastive Learning
Contrastive learning (SimCLR, MoCo) can be seen as a special case of joint embedding where collapse is prevented by explicitly pushing apart negative pairs. The InfoNCE loss:
requires a set of negative examples . JEPA achieves the same goal (informative representations that do not collapse) without negatives, through the EMA and predictor mechanism. In this sense, JEPA subsumes contrastive learning as a special case of energy-based learning with a particular choice of energy shaping.
Common Confusions
JEPA is not a replacement for LLMs
JEPA is a self-supervised learning framework for continuous data (images, video, audio). It does not generate text, answer questions, or reason verbally. It is a research direction toward building world models. systems that understand physical dynamics. which could complement language models, not replace them. The hype around JEPA sometimes obscures this distinction.
Predicting representations is not easier than predicting pixels
Predicting in representation space avoids the generation bottleneck (no need to model irrelevant details) but introduces a new challenge: the representation space is learned simultaneously with the predictor. If the encoder learns trivial representations, prediction is easy but useless. The difficulty shifts from modeling pixels to preventing collapse and ensuring the representations capture meaningful structure.
JEPA is not just BYOL with masking
BYOL predicts the representation of the same input under a different augmentation. I-JEPA predicts the representation of a different part of the input (masked patches). This is a structural prediction task (what is in the missing region?) rather than an invariance task (the representation should not change under crops and color jitter). The masking formulation generalizes naturally to video and spatiotemporal prediction.
Summary
- JEPA predicts in representation space, not pixel space. avoiding the generation bottleneck
- The architecture: context encoder, target encoder (EMA), predictor
- Collapse prevention via EMA, stop-gradient, and the predictor bottleneck
- I-JEPA: masking image patches, predicting their representations
- V-JEPA: masking spatiotemporal video patches
- No augmentations and no negative pairs required. a departure from contrastive methods
- JEPA is a framework for building world models, not a replacement for LLMs
Exercises
Problem
Explain why Masked Autoencoders (MAE) must model low-level details like texture while I-JEPA does not. What architectural difference causes this?
Problem
Why is the EMA momentum parameter critical for preventing collapse in JEPA? What happens if (no momentum, target encoder equals context encoder) and what happens if (target encoder never updates)?
Problem
JEPA predicts representations of masked patches using positional information about which patches to predict. Could you extend this to predict representations of future video frames given past frames and a proposed action? Describe the architecture and explain how this relates to world models for planning.
Related Comparisons
References
Canonical:
- LeCun, "A Path Towards Autonomous Machine Intelligence" (2022). The JEPA proposal
- Assran et al., "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture" (CVPR 2023). I-JEPA
Current:
- Bardes et al., "Revisiting Feature Prediction for Learning Visual Representations from Video" (2024). V-JEPA
- Grill et al., "Bootstrap Your Own Latent" (NeurIPS 2020). BYOL, precursor to JEPA-style methods
- Chen et al., "Exploring Simple Siamese Representation Learning" (CVPR 2021). analysis of collapse prevention
Next Topics
The natural next steps from JEPA:
- World models and planning: using learned representations for planning in imagination
- Vision transformer lineage: the ViT architectures that serve as encoders in I-JEPA and V-JEPA
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- AutoencodersLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A
- Variational AutoencodersLayer 3
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A