Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Beyond Llms

Self-Supervised Vision

Learning visual representations without labels: contrastive methods (SimCLR, MoCo), self-distillation (DINO/DINOv2), and masked image modeling (MAE). Why self-supervised vision matters for transfer learning and label-scarce domains.

AdvancedTier 2Current~50 min
0

Why This Matters

Labeling images is expensive. A single ImageNet label costs a few cents, but labeling a medical image requires a specialist and can cost hundreds of dollars. Self-supervised vision methods learn strong visual representations from unlabeled images, then transfer these representations to downstream tasks with minimal labels.

The practical impact is enormous: DINOv2 features trained on curated unlabeled data match or exceed supervised ImageNet features on classification, segmentation, depth estimation, and retrieval. without any labels during pretraining. Self-supervised vision is how the field is moving beyond the labeled-data bottleneck.

Mental Model

Supervised learning says: here is an image, here is its label, learn features that predict the label. Self-supervised learning says: here is an image, learn features that capture its structure. without any label.

Three families of approaches define the field:

  1. Contrastive: pull together different views of the same image, push apart views of different images. The model learns what makes two crops "the same image" versus "different images."
  2. Self-distillation: a student network learns to match a slowly-updated teacher network. No negative pairs needed. The asymmetry between student and teacher prevents collapse.
  3. Masked modeling: hide parts of the image, predict the hidden parts. The model learns to understand visual structure by reconstruction.

Contrastive Methods

Proposition

InfoNCE and Mutual Information

Statement

The InfoNCE contrastive loss for a positive pair (zi,zi+)(z_i, z_i^+) with N1N-1 negative pairs {zj}\{z_j^-\} is:

LInfoNCE=logexp(sim(zi,zi+)/τ)exp(sim(zi,zi+)/τ)+j=1N1exp(sim(zi,zj)/τ)\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(z_i, z_i^+) / \tau)}{\exp(\text{sim}(z_i, z_i^+) / \tau) + \sum_{j=1}^{N-1} \exp(\text{sim}(z_i, z_j^-) / \tau)}

where sim(a,b)=ab/(ab)\text{sim}(a, b) = a^\top b / (\|a\| \|b\|) is cosine similarity and τ>0\tau > 0 is a temperature parameter.

InfoNCE is a lower bound on the mutual information between the two views: I(zi;zi+)logNLInfoNCEI(z_i; z_i^+) \geq \log N - \mathcal{L}_{\text{InfoNCE}}. Minimizing the loss maximizes a lower bound on mutual information.

Intuition

InfoNCE is a softmax classifier that tries to identify the positive pair among NN candidates. If the model produces embeddings where the positive pair is more similar than all negative pairs, the loss is low. The temperature τ\tau controls the sharpness: small τ\tau makes the model focus on hard negatives (pairs that are similar but should be pushed apart); large τ\tau treats all negatives more equally.

The mutual information interpretation means the model learns representations that preserve information shared between the two views. since the views differ only by augmentation, this shared information is the semantic content of the image.

Why It Matters

InfoNCE is the foundational objective for contrastive self-supervised learning. SimCLR, MoCo, and CLIP all use variants of this loss. Understanding it reveals why contrastive methods need: (1) strong augmentations to create informative positive pairs, (2) many negatives for a tight MI bound, and (3) careful temperature tuning.

Failure Mode

The MI bound is loose when NN is small (few negatives). With N=2N = 2, the bound is at most log20.69\log 2 \approx 0.69 bits regardless of the true MI. This is why contrastive methods benefit from large batch sizes (SimCLR) or memory banks (MoCo). Also, the method can learn shortcuts: if the augmentations are too weak, the model can match views based on low-level statistics (color histograms) rather than semantic content.

SimCLR

SimCLR (Simple Framework for Contrastive Learning of Visual Representations, Chen et al. 2020) implements contrastive learning with four components:

  1. Augmentation: given an image xx, generate two augmented views x~i\tilde{x}_i and x~j\tilde{x}_j using random crops, color jitter, Gaussian blur, and horizontal flips
  2. Encoder: a ResNet or ViT backbone ff maps each view to a representation
  3. Projection head: a small MLP gg maps the representation to the space where the contrastive loss is computed
  4. Contrastive loss: InfoNCE over the batch, treating the other view of the same image as positive and all other images as negatives

Key finding: SimCLR requires large batch sizes (N=4096N = 4096 or more) to provide enough negatives. Performance degrades significantly with small batches.

MoCo

MoCo (Momentum Contrast, He et al. 2020) decouples the need for large batches by maintaining a queue of negative embeddings from previous batches:

  1. A query encoder fqf_q processes the current view
  2. A momentum encoder fkf_k (updated via EMA: θkmθk+(1m)θq\theta_k \leftarrow m\theta_k + (1-m)\theta_q) processes the other view
  3. The queue stores embeddings from the momentum encoder across recent batches
  4. Contrastive loss uses the queue as the negative set

MoCo achieves strong results with standard batch sizes (256) because the queue provides thousands of negatives without requiring them in a single batch.

Self-Distillation: DINO and DINOv2

Proposition

DINO Collapse Prevention via Centering and Sharpening

Statement

DINO trains a student fθsf_{\theta_s} to match a teacher fθtf_{\theta_t} via cross-entropy on softmax outputs. Without safeguards, this objective has trivial solutions (collapse): the teacher outputs a uniform distribution or a constant.

DINO prevents collapse through two mechanisms:

  1. Centering: subtract the running mean cc of teacher outputs before the softmax: Pt(x)=softmax((fθt(x)c)/τt)P_t(x) = \text{softmax}((f_{\theta_t}(x) - c) / \tau_t). This prevents collapse to a single dominant dimension.
  2. Sharpening: use a low teacher temperature τt<τs\tau_t < \tau_s, making the teacher distribution peaked. This prevents collapse to the uniform distribution.

The teacher is updated via EMA: θtλθt+(1λ)θs\theta_t \leftarrow \lambda\theta_t + (1 - \lambda)\theta_s with λ\lambda close to 1 (e.g., 0.996-0.9995).

Intuition

Without centering, the teacher could collapse to always outputting the same vector for every image. The student would trivially match this by also outputting a constant. Centering forces the mean output to be zero, so different images must produce different (centered) representations.

Without sharpening, the teacher could output a uniform distribution over all dimensions. The student would trivially match this without learning anything. Low temperature forces the teacher distribution to be peaked, requiring the student to identify which dimensions are most active for each image.

Together, centering and sharpening ensure the teacher provides non-trivial, image-specific targets.

Why It Matters

DINO showed that self-supervised ViTs learn remarkable emergent properties: the attention maps of the [CLS] token segment objects in images without any segmentation supervision. This suggests that object-level understanding emerges naturally when the model is trained to produce consistent representations across crops. DINOv2 scaled this approach with curated data and distillation, producing features that serve as universal vision backbones.

Failure Mode

The centering mechanism uses an exponential moving average of teacher outputs, which means it adapts slowly. If the distribution of images in a batch is systematically different from previous batches (e.g., due to a non-random data loader), centering can lag behind and fail to prevent collapse temporarily. The multi-crop strategy (student sees local crops, teacher sees global crops) is critical: removing it significantly degrades performance because the student no longer needs to infer global structure from local information.

DINOv2

DINOv2 (Oquab et al. 2023) extends DINO with three key improvements:

  1. Curated pretraining data: an automated pipeline retrieves and deduplicates a high-quality dataset of 142M images from a larger pool, without using any labels
  2. Combined objectives: DINOv2 uses both the DINO self-distillation loss and an iBOT masked image modeling loss, getting the benefits of both paradigms
  3. Distillation: a large ViT-g model is trained first, then smaller models (ViT-S, ViT-B, ViT-L) are distilled from it

DINOv2 features achieve strong performance on classification, segmentation, depth estimation, and retrieval. without fine-tuning. just a linear probe or kk-NN on frozen features. This makes DINOv2 a "foundation feature extractor" for vision.

Masked Image Modeling

Proposition

Masked Autoencoder (MAE) Objective

Statement

MAE (Masked Autoencoder, He et al. 2022) masks a large fraction (75%) of image patches and trains the model to reconstruct them:

  1. Partition the image into patches {p1,,pN}\{p_1, \ldots, p_N\}
  2. Randomly mask 75% of patches: visible set V\mathcal{V}, masked set M\mathcal{M}
  3. Encode only the visible patches with a ViT encoder: {hi}iV=fenc({pi}iV)\{h_i\}_{i \in \mathcal{V}} = f_\text{enc}(\{p_i\}_{i \in \mathcal{V}})
  4. Decode all patches (visible + mask tokens) with a lightweight decoder
  5. Reconstruction loss is computed only on masked patches:

LMAE=1MiMpip^i2\mathcal{L}_{\text{MAE}} = \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \|p_i - \hat{p}_i\|^2

where p^i\hat{p}_i is the reconstructed patch (in pixel space or normalized pixel space).

Intuition

MAE is the visual analogue of masked language modeling (BERT). By hiding 75% of the image, the model must understand spatial layout, object structure, and texture patterns to fill in the missing pieces. The high masking ratio is critical: if only 10% is masked, the model can "cheat" by interpolating from nearby visible patches without deep understanding. At 75%, the visible patches are sparse enough that reconstruction requires genuine visual reasoning.

The asymmetric design (heavy encoder on visible patches only, lightweight decoder on all patches) makes training efficient: the encoder processes only 25% of patches, giving a 4×4\times speedup over processing all patches.

Why It Matters

MAE demonstrated that masked modeling, hugely successful for language (BERT, GPT), also works for vision. It produces strong features with efficient training and shows that ViTs can learn useful representations from reconstruction alone. The 75% masking ratio was a surprising finding. It suggested that images have much higher redundancy than text, requiring more aggressive masking to create a challenging pretext task.

Failure Mode

MAE features are weaker than contrastive or self-distillation features for linear probing (using frozen features with a linear classifier). The reconstruction objective encourages the model to retain low-level details (textures, colors) that are useful for reconstruction but less useful for semantic classification. Fine-tuning MAE features on downstream tasks works well, but the frozen features underperform DINO/DINOv2. This suggests that reconstruction and discrimination learn different aspects of visual representation.

Why Self-Supervised Vision Matters

  1. Labels are expensive: medical imaging, satellite imagery, industrial inspection, and scientific domains have abundant unlabeled data but very few labels. Self-supervised pretraining uses the unlabeled data.
  2. Pretrained features transfer well: DINOv2 features trained on natural images transfer to medical, satellite, and other domains. often matching domain-specific supervised models with far less labeled data.
  3. Beyond classification: self-supervised features encode spatial structure, depth, and material properties that supervised classification features miss (supervised models are biased toward the specific label set).
  4. Scaling: self-supervised methods can exploit internet-scale unlabeled data, which is far more abundant than labeled data.

Contrastive vs. Generative vs. Self-Distillation

PropertyContrastive (SimCLR, MoCo)Self-Distillation (DINO)Masked (MAE)
ObjectivePull same-image views together, push different apartStudent matches teacherReconstruct masked patches
Negative pairsRequired (many)Not requiredNot required
Batch size sensitivityHigh (SimCLR) or mitigated (MoCo)LowLow
Linear probe qualityStrongStrongestWeaker
Fine-tuning qualityStrongStrongStrong
Computational costModerateModerateLow (processes 25% of patches)
Emergent propertiesModerateStrong (object segmentation)Weaker

Common Confusions

Watch Out

Self-supervised does not mean unsupervised clustering

Self-supervised learning learns representations using a pretext task (contrastive matching, reconstruction, distillation). It does not cluster or label the data. The representations are then used for downstream tasks with a small amount of labeled data (linear probing, fine-tuning) or zero-shot transfer. Self-supervised is a pretraining strategy, not a complete learning pipeline.

Watch Out

Contrastive methods are sensitive to augmentation choices

The augmentations define what information the model preserves versus discards. If color jitter is included, the model learns color-invariant features (good for object recognition but bad for color-dependent tasks). If it is excluded, the model may use color as a shortcut. The choice of augmentations encodes domain knowledge and should be tuned for the target application.

Watch Out

MAE features are not worse than DINO; they are different

MAE features retain more low-level, spatially detailed information (good for reconstruction, dense prediction after fine-tuning). DINO features are more semantically abstracted (good for classification, retrieval with frozen features). The right choice depends on the downstream task and whether you plan to fine-tune or use features frozen.

Summary

  • Contrastive methods (SimCLR, MoCo) learn by pulling together augmented views of the same image and pushing apart different images via InfoNCE
  • InfoNCE is a lower bound on mutual information; more negatives give a tighter bound
  • DINO uses self-distillation with EMA teacher; centering and sharpening prevent collapse; emergent object segmentation appears in attention maps
  • DINOv2 produces universal frozen features competitive with supervised pretraining across diverse tasks
  • MAE masks 75% of patches and reconstructs them; efficient training but weaker linear-probe features than DINO
  • Self-supervised vision matters because labels are expensive and pretrained features transfer broadly

Exercises

ExerciseCore

Problem

In SimCLR with batch size N=256N = 256, each image produces two augmented views, giving 2N=5122N = 512 total views. For a given view, how many positive pairs and how many negative pairs does it have? What is the effective number of negatives in the InfoNCE denominator?

ExerciseAdvanced

Problem

MAE masks 75% of patches and encodes only the visible 25%. For a 224x224 image with patch size 16, compute: (a) total patches, (b) visible patches, (c) the speedup factor for the encoder compared to processing all patches, and (d) why the decoder processes all patches but can still be lightweight.

ExerciseResearch

Problem

DINO learns object segmentation without segmentation labels: the [CLS] token's attention map highlights objects. Explain mechanistically why this emerges from the self-distillation objective with multi-crop augmentation. What would happen if you used only global crops (no local crops)?

Related Comparisons

References

Canonical:

  • Chen et al., "A Simple Framework for Contrastive Learning of Visual Representations" (ICML 2020). SimCLR
  • He et al., "Momentum Contrast for Unsupervised Visual Representation Learning" (CVPR 2020). MoCo
  • Caron et al., "Emerging Properties in Self-Supervised Vision Transformers" (ICCV 2021). DINO

Current:

  • Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision" (2023)
  • He et al., "Masked Autoencoders Are Scalable Vision Learners" (CVPR 2022). MAE
  • Oord, Li, Vinyals, "Representation Learning with Contrastive Predictive Coding" (2018). InfoNCE

Next Topics

The natural next steps from self-supervised vision:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics