Self-Supervised Vision

Sneiderman, Robby

Beyond LLMS

Self-Supervised Vision

Learning visual representations without labels: contrastive methods (SimCLR, MoCo), self-distillation (DINO/DINOv2), and masked image modeling (MAE). Why self-supervised vision matters for transfer learning and label-scarce domains.

AdvancedTier 2CurrentSupporting~50 min

Prerequisites

Vision Transformer Lineage Attention for Protein Structure Alphafold CNNS for Signal Feature Extraction

Prereq Map

Learning position

Read this page in the graph.

beyond-llms | layer 4 | tier 2. This page has 3 direct prerequisites and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

JEPA and Joint Embedding

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Labeling images is expensive. A single ImageNet label costs a few cents, but labeling a medical image requires a specialist and can cost hundreds of dollars. Self-supervised vision methods learn strong visual representations from unlabeled images, then transfer these representations to downstream tasks with minimal labels.

The practical impact is enormous: DINOv2 features trained on curated unlabeled data match or exceed supervised ImageNet features on classification, segmentation, depth estimation, and retrieval. without any labels during pretraining. Self-supervised vision is how the field is moving beyond the labeled-data bottleneck.

Mental Model

Supervised learning says: here is an image, here is its label, learn features that predict the label. Self-supervised learning says: here is an image, learn features that capture its structure. without any label.

Three families of approaches define the field:

Contrastive: pull together different views of the same image, push apart views of different images. The model learns what makes two crops "the same image" versus "different images."
Self-distillation: a student network learns to match a slowly-updated teacher network. No negative pairs needed. The asymmetry between student and teacher prevents collapse.
Masked modeling: hide parts of the image, predict the hidden parts. The model learns to understand visual structure by reconstruction.

Contrastive Methods

Proposition

InfoNCE and Mutual Information

Statement

The InfoNCE contrastive loss for a positive pair $(z_i, z_i^+)$ with $N-1$ negative pairs $\{z_j^-\}$ is:

$\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(z_i, z_i^+) / \tau)}{\exp(\text{sim}(z_i, z_i^+) / \tau) + \sum_{j=1}^{N-1} \exp(\text{sim}(z_i, z_j^-) / \tau)}$

where $\text{sim}(a, b) = a^\top b / (\|a\| \|b\|)$ is cosine similarity and $\tau > 0$ is a temperature parameter.

InfoNCE is a lower bound on the mutual information between the two views: $I(z_i; z_i^+) \geq \log N - \mathcal{L}_{\text{InfoNCE}}$ . Minimizing the loss maximizes a lower bound on mutual information.

Intuition

InfoNCE is a softmax classifier that tries to identify the positive pair among $N$ candidates. If the model produces embeddings where the positive pair is more similar than all negative pairs, the loss is low. The temperature $\tau$ controls the sharpness: small $\tau$ makes the model focus on hard negatives (pairs that are similar but should be pushed apart); large $\tau$ treats all negatives more equally.

The mutual information interpretation means the model learns representations that preserve information shared between the two views. since the views differ only by augmentation, this shared information is the semantic content of the image.

Why It Matters

InfoNCE is the foundational objective for contrastive self-supervised learning. SimCLR, MoCo, and CLIP all use variants of this loss. Understanding it reveals why contrastive methods need: (1) strong augmentations to create informative positive pairs, (2) many negatives for a tight MI bound, and (3) careful temperature tuning.

Failure Mode

The MI bound is loose when $N$ is small (few negatives). With $N = 2$ , the bound is at most $\log 2 \approx 0.69$ bits regardless of the true MI. This is why contrastive methods benefit from large batch sizes (SimCLR) or memory banks (MoCo). Also, the method can learn shortcuts: if the augmentations are too weak, the model can match views based on low-level statistics (color histograms) rather than semantic content.

report a correction →

SimCLR

SimCLR (Simple Framework for Contrastive Learning of Visual Representations, Chen et al. 2020) implements contrastive learning with four components:

Augmentation: given an image $x$ , generate two augmented views $\tilde{x}_i$ and $\tilde{x}_j$ using random crops, color jitter, Gaussian blur, and horizontal flips
Encoder: a ResNet or ViT backbone $f$ maps each view to a representation
Projection head: a small MLP $g$ maps the representation to the space where the contrastive loss is computed
Contrastive loss: InfoNCE over the batch, treating the other view of the same image as positive and all other images as negatives

Key finding: SimCLR requires large batch sizes ( $N = 4096$ or more) to provide enough negatives. Performance degrades significantly with small batches.

MoCo

MoCo (Momentum Contrast, He et al. 2020) decouples the need for large batches by maintaining a queue of negative embeddings from previous batches:

A query encoder $f_q$ processes the current view
A momentum encoder $f_k$ (updated via EMA: $\theta_k \leftarrow m\theta_k + (1-m)\theta_q$ ) processes the other view
The queue stores embeddings from the momentum encoder across recent batches
Contrastive loss uses the queue as the negative set

MoCo achieves strong results with standard batch sizes (256) because the queue provides thousands of negatives without requiring them in a single batch.

Self-Distillation: DINO and DINOv2

Proposition

DINO Collapse Prevention via Centering and Sharpening

Statement

DINO trains a student $f_{\theta_s}$ to match a teacher $f_{\theta_t}$ via cross-entropy on softmax outputs. Without safeguards, this objective has trivial solutions (collapse): the teacher outputs a uniform distribution or a constant.

DINO prevents collapse through two mechanisms:

Centering: subtract the running mean $c$ of teacher outputs before the softmax: $P_t(x) = \text{softmax}((f_{\theta_t}(x) - c) / \tau_t)$ . This prevents collapse to a single dominant dimension.
Sharpening: use a low teacher temperature $\tau_t < \tau_s$ , making the teacher distribution peaked. This prevents collapse to the uniform distribution.

The teacher is updated via EMA: $\theta_t \leftarrow \lambda\theta_t + (1 - \lambda)\theta_s$ with $\lambda$ close to 1 (e.g., 0.996-0.9995).

Intuition

Without centering, the teacher could collapse to always outputting the same vector for every image. The student would trivially match this by also outputting a constant. Centering forces the mean output to be zero, so different images must produce different (centered) representations.

Without sharpening, the teacher could output a uniform distribution over all dimensions. The student would trivially match this without learning anything. Low temperature forces the teacher distribution to be peaked, requiring the student to identify which dimensions are most active for each image.

Together, centering and sharpening ensure the teacher provides non-trivial, image-specific targets.

Why It Matters

DINO showed that self-supervised ViTs learn remarkable emergent properties: the attention maps of the [CLS] token segment objects in images without any segmentation supervision. This suggests that object-level understanding emerges naturally when the model is trained to produce consistent representations across crops. DINOv2 scaled this approach with curated data and distillation, producing features that serve as universal vision backbones.

Failure Mode

The centering mechanism uses an exponential moving average of teacher outputs, which means it adapts slowly. If the distribution of images in a batch is systematically different from previous batches (e.g., due to a non-random data loader), centering can lag behind and fail to prevent collapse temporarily. The multi-crop strategy (student sees local crops, teacher sees global crops) is critical: removing it significantly degrades performance because the student no longer needs to infer global structure from local information.

report a correction →

DINOv2

DINOv2 (Oquab et al. 2023) extends DINO with three key improvements:

Curated pretraining data: an automated pipeline retrieves and deduplicates a high-quality dataset of 142M images from a larger pool, without using any labels
Combined objectives: DINOv2 uses both the DINO self-distillation loss and an iBOT masked image modeling loss, getting the benefits of both paradigms
Distillation: a large ViT-g model is trained first, then smaller models (ViT-S, ViT-B, ViT-L) are distilled from it

DINOv2 features achieve strong performance on classification, segmentation, depth estimation, and retrieval. without fine-tuning. just a linear probe or $k$ -NN on frozen features. This makes DINOv2 a "foundation feature extractor" for vision.

Masked Image Modeling

Proposition

Masked Autoencoder (MAE) Objective

Statement

MAE (Masked Autoencoder, He et al. 2022) masks a large fraction (75%) of image patches and trains the model to reconstruct them:

Partition the image into patches $\{p_1, \ldots, p_N\}$
Randomly mask 75% of patches: visible set $\mathcal{V}$ , masked set $\mathcal{M}$
Encode only the visible patches with a ViT encoder: $\{h_i\}_{i \in \mathcal{V}} = f_\text{enc}(\{p_i\}_{i \in \mathcal{V}})$
Decode all patches (visible + mask tokens) with a lightweight decoder
Reconstruction loss is computed only on masked patches:

$\mathcal{L}_{\text{MAE}} = \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \|p_i - \hat{p}_i\|^2$

where $\hat{p}_i$ is the reconstructed patch (in pixel space or normalized pixel space).

Intuition

MAE is the visual analogue of masked language modeling (BERT). By hiding 75% of the image, the model must understand spatial layout, object structure, and texture patterns to fill in the missing pieces. The high masking ratio is critical: if only 10% is masked, the model can "cheat" by interpolating from nearby visible patches without deep understanding. At 75%, the visible patches are sparse enough that reconstruction requires genuine visual reasoning.

The asymmetric design (heavy encoder on visible patches only, lightweight decoder on all patches) makes training efficient: the encoder processes only 25% of patches, giving a $4\times$ speedup over processing all patches.

Why It Matters

MAE demonstrated that masked modeling, hugely successful for language (BERT, GPT), also works for vision. It produces strong features with efficient training and shows that ViTs can learn useful representations from reconstruction alone. The 75% masking ratio was a surprising finding. It suggested that images have much higher redundancy than text, requiring more aggressive masking to create a challenging pretext task.

Failure Mode

MAE features are weaker than contrastive or self-distillation features for linear probing (using frozen features with a linear classifier). The reconstruction objective encourages the model to retain low-level details (textures, colors) that are useful for reconstruction but less useful for semantic classification. Fine-tuning MAE features on downstream tasks works well, but the frozen features underperform DINO/DINOv2. This suggests that reconstruction and discrimination learn different aspects of visual representation.

report a correction →

Why Self-Supervised Vision Matters

Labels are expensive: medical imaging, satellite imagery, industrial inspection, and scientific domains have abundant unlabeled data but very few labels. Self-supervised pretraining uses the unlabeled data.
Pretrained features transfer well: DINOv2 features trained on natural images transfer to medical, satellite, and other domains. often matching domain-specific supervised models with far less labeled data.
Beyond classification: self-supervised features encode spatial structure, depth, and material properties that supervised classification features miss (supervised models are biased toward the specific label set).
Scaling: self-supervised methods can exploit internet-scale unlabeled data, which is far more abundant than labeled data.

Contrastive vs. Generative vs. Self-Distillation

Property	Contrastive (SimCLR, MoCo)	Self-Distillation (DINO)	Masked (MAE)
Objective	Pull same-image views together, push different apart	Student matches teacher	Reconstruct masked patches
Negative pairs	Required (many)	Not required	Not required
Batch size sensitivity	High (SimCLR) or mitigated (MoCo)	Low	Low
Linear probe quality	Strong	Strongest	Weaker
Fine-tuning quality	Strong	Strong	Strong
Computational cost	Moderate	Moderate	Low (processes 25% of patches)
Emergent properties	Moderate	Strong (object segmentation)	Weaker

Common Confusions

Watch Out

Self-supervised does not mean unsupervised clustering

Self-supervised learning learns representations using a pretext task (contrastive matching, reconstruction, distillation). It does not cluster or label the data. The representations are then used for downstream tasks with a small amount of labeled data (linear probing, fine-tuning) or zero-shot transfer. Self-supervised is a pretraining strategy, not a complete learning pipeline.

Watch Out

Contrastive methods are sensitive to augmentation choices

The augmentations define what information the model preserves versus discards. If color jitter is included, the model learns color-invariant features (good for object recognition but bad for color-dependent tasks). If it is excluded, the model may use color as a shortcut. The choice of augmentations encodes domain knowledge and should be tuned for the target application.

Watch Out

MAE features are not worse than DINO; they are different

MAE features retain more low-level, spatially detailed information (good for reconstruction, dense prediction after fine-tuning). DINO features are more semantically abstracted (good for classification, retrieval with frozen features). The right choice depends on the downstream task and whether you plan to fine-tune or use features frozen.

Summary

Contrastive methods (SimCLR, MoCo) learn by pulling together augmented views of the same image and pushing apart different images via InfoNCE
InfoNCE is a lower bound on mutual information; more negatives give a tighter bound
DINO uses self-distillation with EMA teacher; centering and sharpening prevent collapse; emergent object segmentation appears in attention maps
DINOv2 produces universal frozen features competitive with supervised pretraining across diverse tasks
MAE masks 75% of patches and reconstructs them; efficient training but weaker linear-probe features than DINO
Self-supervised vision matters because labels are expensive and pretrained features transfer broadly

Exercises

ExerciseCore

Problem

In SimCLR with batch size $N = 256$ , each image produces two augmented views, giving $2N = 512$ total views. For a given view, how many positive pairs and how many negative pairs does it have? What is the effective number of negatives in the InfoNCE denominator?

ExerciseAdvanced

Problem

MAE masks 75% of patches and encodes only the visible 25%. For a 224x224 image with patch size 16, compute: (a) total patches, (b) visible patches, (c) the speedup factor for the encoder compared to processing all patches, and (d) why the decoder processes all patches but can still be lightweight.

ExerciseResearch

Problem

DINO learns object segmentation without segmentation labels: the [CLS] token's attention map highlights objects. Explain mechanistically why this emerges from the self-distillation objective with multi-crop augmentation. What would happen if you used only global crops (no local crops)?

Related Comparisons

Contrastive Loss vs. Triplet Loss

References

Canonical:

Chen et al., "A Simple Framework for Contrastive Learning of Visual Representations" (ICML 2020). SimCLR
He et al., "Momentum Contrast for Unsupervised Visual Representation Learning" (CVPR 2020). MoCo
Caron et al., "Emerging Properties in Self-Supervised Vision Transformers" (ICCV 2021). DINO

Current:

Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision" (2023)
He et al., "Masked Autoencoders Are Scalable Vision Learners" (CVPR 2022). MAE
Oord, Li, Vinyals, "Representation Learning with Contrastive Predictive Coding" (2018). InfoNCE

Next Topics

The natural next steps from self-supervised vision:

JEPA and joint embedding: architectures that predict in latent space rather than pixel space
Data augmentation theory: why augmentation choices determine what self-supervised models learn

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMlayer 4 · tier 1
Attention for Protein Structure: AlphaFold and Successorslayer 4 · tier 3
CNNs for Signal Feature Extractionlayer 4 · tier 3

Derived topics

4

Data Augmentation Theorylayer 2 · tier 2
JEPA and Joint Embeddinglayer 4 · tier 2
Florence and Vision Foundation Modelslayer 5 · tier 2
Representation Learning in Cosmologylayer 4 · tier 3

Graph-backed continuations

JEPA and Joint Embedding Data Augmentation Theory Florence and Vision Foundation Models Representation Learning in Cosmology