Beyond Llms
Self-Supervised Vision
Learning visual representations without labels: contrastive methods (SimCLR, MoCo), self-distillation (DINO/DINOv2), and masked image modeling (MAE). Why self-supervised vision matters for transfer learning and label-scarce domains.
Prerequisites
Why This Matters
Labeling images is expensive. A single ImageNet label costs a few cents, but labeling a medical image requires a specialist and can cost hundreds of dollars. Self-supervised vision methods learn strong visual representations from unlabeled images, then transfer these representations to downstream tasks with minimal labels.
The practical impact is enormous: DINOv2 features trained on curated unlabeled data match or exceed supervised ImageNet features on classification, segmentation, depth estimation, and retrieval. without any labels during pretraining. Self-supervised vision is how the field is moving beyond the labeled-data bottleneck.
Mental Model
Supervised learning says: here is an image, here is its label, learn features that predict the label. Self-supervised learning says: here is an image, learn features that capture its structure. without any label.
Three families of approaches define the field:
- Contrastive: pull together different views of the same image, push apart views of different images. The model learns what makes two crops "the same image" versus "different images."
- Self-distillation: a student network learns to match a slowly-updated teacher network. No negative pairs needed. The asymmetry between student and teacher prevents collapse.
- Masked modeling: hide parts of the image, predict the hidden parts. The model learns to understand visual structure by reconstruction.
Contrastive Methods
InfoNCE and Mutual Information
Statement
The InfoNCE contrastive loss for a positive pair with negative pairs is:
where is cosine similarity and is a temperature parameter.
InfoNCE is a lower bound on the mutual information between the two views: . Minimizing the loss maximizes a lower bound on mutual information.
Intuition
InfoNCE is a softmax classifier that tries to identify the positive pair among candidates. If the model produces embeddings where the positive pair is more similar than all negative pairs, the loss is low. The temperature controls the sharpness: small makes the model focus on hard negatives (pairs that are similar but should be pushed apart); large treats all negatives more equally.
The mutual information interpretation means the model learns representations that preserve information shared between the two views. since the views differ only by augmentation, this shared information is the semantic content of the image.
Why It Matters
InfoNCE is the foundational objective for contrastive self-supervised learning. SimCLR, MoCo, and CLIP all use variants of this loss. Understanding it reveals why contrastive methods need: (1) strong augmentations to create informative positive pairs, (2) many negatives for a tight MI bound, and (3) careful temperature tuning.
Failure Mode
The MI bound is loose when is small (few negatives). With , the bound is at most bits regardless of the true MI. This is why contrastive methods benefit from large batch sizes (SimCLR) or memory banks (MoCo). Also, the method can learn shortcuts: if the augmentations are too weak, the model can match views based on low-level statistics (color histograms) rather than semantic content.
SimCLR
SimCLR (Simple Framework for Contrastive Learning of Visual Representations, Chen et al. 2020) implements contrastive learning with four components:
- Augmentation: given an image , generate two augmented views and using random crops, color jitter, Gaussian blur, and horizontal flips
- Encoder: a ResNet or ViT backbone maps each view to a representation
- Projection head: a small MLP maps the representation to the space where the contrastive loss is computed
- Contrastive loss: InfoNCE over the batch, treating the other view of the same image as positive and all other images as negatives
Key finding: SimCLR requires large batch sizes ( or more) to provide enough negatives. Performance degrades significantly with small batches.
MoCo
MoCo (Momentum Contrast, He et al. 2020) decouples the need for large batches by maintaining a queue of negative embeddings from previous batches:
- A query encoder processes the current view
- A momentum encoder (updated via EMA: ) processes the other view
- The queue stores embeddings from the momentum encoder across recent batches
- Contrastive loss uses the queue as the negative set
MoCo achieves strong results with standard batch sizes (256) because the queue provides thousands of negatives without requiring them in a single batch.
Self-Distillation: DINO and DINOv2
DINO Collapse Prevention via Centering and Sharpening
Statement
DINO trains a student to match a teacher via cross-entropy on softmax outputs. Without safeguards, this objective has trivial solutions (collapse): the teacher outputs a uniform distribution or a constant.
DINO prevents collapse through two mechanisms:
- Centering: subtract the running mean of teacher outputs before the softmax: . This prevents collapse to a single dominant dimension.
- Sharpening: use a low teacher temperature , making the teacher distribution peaked. This prevents collapse to the uniform distribution.
The teacher is updated via EMA: with close to 1 (e.g., 0.996-0.9995).
Intuition
Without centering, the teacher could collapse to always outputting the same vector for every image. The student would trivially match this by also outputting a constant. Centering forces the mean output to be zero, so different images must produce different (centered) representations.
Without sharpening, the teacher could output a uniform distribution over all dimensions. The student would trivially match this without learning anything. Low temperature forces the teacher distribution to be peaked, requiring the student to identify which dimensions are most active for each image.
Together, centering and sharpening ensure the teacher provides non-trivial, image-specific targets.
Why It Matters
DINO showed that self-supervised ViTs learn remarkable emergent properties: the attention maps of the [CLS] token segment objects in images without any segmentation supervision. This suggests that object-level understanding emerges naturally when the model is trained to produce consistent representations across crops. DINOv2 scaled this approach with curated data and distillation, producing features that serve as universal vision backbones.
Failure Mode
The centering mechanism uses an exponential moving average of teacher outputs, which means it adapts slowly. If the distribution of images in a batch is systematically different from previous batches (e.g., due to a non-random data loader), centering can lag behind and fail to prevent collapse temporarily. The multi-crop strategy (student sees local crops, teacher sees global crops) is critical: removing it significantly degrades performance because the student no longer needs to infer global structure from local information.
DINOv2
DINOv2 (Oquab et al. 2023) extends DINO with three key improvements:
- Curated pretraining data: an automated pipeline retrieves and deduplicates a high-quality dataset of 142M images from a larger pool, without using any labels
- Combined objectives: DINOv2 uses both the DINO self-distillation loss and an iBOT masked image modeling loss, getting the benefits of both paradigms
- Distillation: a large ViT-g model is trained first, then smaller models (ViT-S, ViT-B, ViT-L) are distilled from it
DINOv2 features achieve strong performance on classification, segmentation, depth estimation, and retrieval. without fine-tuning. just a linear probe or -NN on frozen features. This makes DINOv2 a "foundation feature extractor" for vision.
Masked Image Modeling
Masked Autoencoder (MAE) Objective
Statement
MAE (Masked Autoencoder, He et al. 2022) masks a large fraction (75%) of image patches and trains the model to reconstruct them:
- Partition the image into patches
- Randomly mask 75% of patches: visible set , masked set
- Encode only the visible patches with a ViT encoder:
- Decode all patches (visible + mask tokens) with a lightweight decoder
- Reconstruction loss is computed only on masked patches:
where is the reconstructed patch (in pixel space or normalized pixel space).
Intuition
MAE is the visual analogue of masked language modeling (BERT). By hiding 75% of the image, the model must understand spatial layout, object structure, and texture patterns to fill in the missing pieces. The high masking ratio is critical: if only 10% is masked, the model can "cheat" by interpolating from nearby visible patches without deep understanding. At 75%, the visible patches are sparse enough that reconstruction requires genuine visual reasoning.
The asymmetric design (heavy encoder on visible patches only, lightweight decoder on all patches) makes training efficient: the encoder processes only 25% of patches, giving a speedup over processing all patches.
Why It Matters
MAE demonstrated that masked modeling, hugely successful for language (BERT, GPT), also works for vision. It produces strong features with efficient training and shows that ViTs can learn useful representations from reconstruction alone. The 75% masking ratio was a surprising finding. It suggested that images have much higher redundancy than text, requiring more aggressive masking to create a challenging pretext task.
Failure Mode
MAE features are weaker than contrastive or self-distillation features for linear probing (using frozen features with a linear classifier). The reconstruction objective encourages the model to retain low-level details (textures, colors) that are useful for reconstruction but less useful for semantic classification. Fine-tuning MAE features on downstream tasks works well, but the frozen features underperform DINO/DINOv2. This suggests that reconstruction and discrimination learn different aspects of visual representation.
Why Self-Supervised Vision Matters
- Labels are expensive: medical imaging, satellite imagery, industrial inspection, and scientific domains have abundant unlabeled data but very few labels. Self-supervised pretraining uses the unlabeled data.
- Pretrained features transfer well: DINOv2 features trained on natural images transfer to medical, satellite, and other domains. often matching domain-specific supervised models with far less labeled data.
- Beyond classification: self-supervised features encode spatial structure, depth, and material properties that supervised classification features miss (supervised models are biased toward the specific label set).
- Scaling: self-supervised methods can exploit internet-scale unlabeled data, which is far more abundant than labeled data.
Contrastive vs. Generative vs. Self-Distillation
| Property | Contrastive (SimCLR, MoCo) | Self-Distillation (DINO) | Masked (MAE) |
|---|---|---|---|
| Objective | Pull same-image views together, push different apart | Student matches teacher | Reconstruct masked patches |
| Negative pairs | Required (many) | Not required | Not required |
| Batch size sensitivity | High (SimCLR) or mitigated (MoCo) | Low | Low |
| Linear probe quality | Strong | Strongest | Weaker |
| Fine-tuning quality | Strong | Strong | Strong |
| Computational cost | Moderate | Moderate | Low (processes 25% of patches) |
| Emergent properties | Moderate | Strong (object segmentation) | Weaker |
Common Confusions
Self-supervised does not mean unsupervised clustering
Self-supervised learning learns representations using a pretext task (contrastive matching, reconstruction, distillation). It does not cluster or label the data. The representations are then used for downstream tasks with a small amount of labeled data (linear probing, fine-tuning) or zero-shot transfer. Self-supervised is a pretraining strategy, not a complete learning pipeline.
Contrastive methods are sensitive to augmentation choices
The augmentations define what information the model preserves versus discards. If color jitter is included, the model learns color-invariant features (good for object recognition but bad for color-dependent tasks). If it is excluded, the model may use color as a shortcut. The choice of augmentations encodes domain knowledge and should be tuned for the target application.
MAE features are not worse than DINO; they are different
MAE features retain more low-level, spatially detailed information (good for reconstruction, dense prediction after fine-tuning). DINO features are more semantically abstracted (good for classification, retrieval with frozen features). The right choice depends on the downstream task and whether you plan to fine-tune or use features frozen.
Summary
- Contrastive methods (SimCLR, MoCo) learn by pulling together augmented views of the same image and pushing apart different images via InfoNCE
- InfoNCE is a lower bound on mutual information; more negatives give a tighter bound
- DINO uses self-distillation with EMA teacher; centering and sharpening prevent collapse; emergent object segmentation appears in attention maps
- DINOv2 produces universal frozen features competitive with supervised pretraining across diverse tasks
- MAE masks 75% of patches and reconstructs them; efficient training but weaker linear-probe features than DINO
- Self-supervised vision matters because labels are expensive and pretrained features transfer broadly
Exercises
Problem
In SimCLR with batch size , each image produces two augmented views, giving total views. For a given view, how many positive pairs and how many negative pairs does it have? What is the effective number of negatives in the InfoNCE denominator?
Problem
MAE masks 75% of patches and encodes only the visible 25%. For a 224x224 image with patch size 16, compute: (a) total patches, (b) visible patches, (c) the speedup factor for the encoder compared to processing all patches, and (d) why the decoder processes all patches but can still be lightweight.
Problem
DINO learns object segmentation without segmentation labels: the [CLS] token's attention map highlights objects. Explain mechanistically why this emerges from the self-distillation objective with multi-crop augmentation. What would happen if you used only global crops (no local crops)?
Related Comparisons
References
Canonical:
- Chen et al., "A Simple Framework for Contrastive Learning of Visual Representations" (ICML 2020). SimCLR
- He et al., "Momentum Contrast for Unsupervised Visual Representation Learning" (CVPR 2020). MoCo
- Caron et al., "Emerging Properties in Self-Supervised Vision Transformers" (ICCV 2021). DINO
Current:
- Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision" (2023)
- He et al., "Masked Autoencoders Are Scalable Vision Learners" (CVPR 2022). MAE
- Oord, Li, Vinyals, "Representation Learning with Contrastive Predictive Coding" (2018). InfoNCE
Next Topics
The natural next steps from self-supervised vision:
- JEPA and joint embedding: architectures that predict in latent space rather than pixel space
- Data augmentation theory: why augmentation choices determine what self-supervised models learn
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Vision Transformer LineageLayer 4
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Convolutional Neural NetworksLayer 3
- Vectors, Matrices, and Linear MapsLayer 0A