Contrastive Learning

Sneiderman, Robby

ML Methods

Contrastive Learning

Learning representations by pulling positive pairs together and pushing negative pairs apart, with theoretical grounding in mutual information maximization.

AdvancedTier 2CurrentSupporting~55 min

Prerequisites

Feedforward Networks and Backpropagation

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 3 | tier 2. This page has 1 direct prerequisite and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Data Augmentation Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Contrastive learning is the dominant paradigm for self-supervised representation learning. It produces representations competitive with supervised pretraining while using no labels. The key models in this space (CLIP, SimCLR, MoCo) all use contrastive objectives. Understanding the theory behind contrastive learning explains why augmentation choices matter, why large batch sizes help, and what the loss function actually optimizes.

The Contrastive Setup

Given an input $x$ , create two views $x^+$ (positive) by applying random augmentations. Sample $N-1$ other inputs $\{x_1^-, \ldots, x_{N-1}^-\}$ as negatives. Learn an encoder $f$ such that $f(x)$ and $f(x^+)$ are close while $f(x)$ and $f(x_i^-)$ are far apart.

Definition

Positive Pair

Two views $(x_i, x_j)$ form a positive pair if and only if they are derived from the same underlying data point (e.g., two augmentations of the same image). The contrastive objective trains the encoder to map positive pairs to nearby points in representation space.

Definition

InfoNCE Loss $L_{InfoNCE}$

For anchor $z$ , positive $z^+$ , and $N-1$ negatives $\{z_1^-, \ldots, z_{N-1}^-\}$ where $z = f(x)$ :

$\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(z, z^+)/\tau)}{\exp(\text{sim}(z, z^+)/\tau) + \sum_{j=1}^{N-1} \exp(\text{sim}(z, z_j^-)/\tau)}$

Here $\text{sim}(\cdot,\cdot)$ is cosine similarity and $\tau > 0$ is a temperature parameter. This is a softmax cross-entropy over $N$ options where the correct answer is the positive pair.

Core Theory

Theorem

InfoNCE as Mutual Information Lower Bound

Statement

The InfoNCE loss with $N$ negative samples satisfies:

$I(X; X^+) \geq \log N - \mathcal{L}_{\text{InfoNCE}}$

where $I(X; X^+)$ is the mutual information between the anchor and positive views. Minimizing InfoNCE maximizes a lower bound on mutual information.

Intuition

InfoNCE is an $N$ -way classification problem: identify the positive among $N$ candidates. The $\log N - \mathcal{L}_{\text{InfoNCE}}$ quantity is a lower bound on $I(X; X^+)$ , and perfect classification (loss $= 0$ ) certifies at least $\log N$ bits of shared information. The representation itself can encode strictly more MI than $\log N$ ; the bound just cannot witness it without more negatives.

Proof Sketch

Write the InfoNCE objective as a density ratio estimation problem. The optimal critic is $\text{sim}(z, z^+)/\tau \propto \log \frac{p(z^+ \mid z)}{p(z^+)}$ . Apply the variational bound on mutual information from Barber and Agakov (2003). The $\log N$ term appears because the bound saturates: you cannot extract more than $\log N$ bits from an $N$ -way classification.

Why It Matters

This explains two empirical observations. First, larger batch sizes (more negatives) give a tighter MI lower bound — $\log N$ caps how much MI the InfoNCE estimator can certify, not how much MI the encoder can actually capture. So small $N$ both loosens the bound and weakens the gradient signal at high MI. Second, the quality of augmentations matters because they determine what information is shared between positive pairs and thus what $I(X; X^+)$ contains.

Failure Mode

The bound becomes loose when the true MI is much larger than $\log N$ . With $N = 256$ , the InfoNCE estimator can only certify up to $\log 256 \approx 5.5$ nats, even though the encoder may carry far more shared information. If the task requires recovering more MI than the bound can witness, increasing $N$ tightens the bound and sharpens the gradient. Also, the bound says nothing about which information is captured; bad augmentations can make the model learn shortcuts.

report a correction →

Proposition

Alignment and Uniformity Decomposition

Statement

The contrastive loss decomposes into two competing objectives:

Alignment: $\mathbb{E}_{(x,x^+) \sim p_{\text{pos}}} \|f(x) - f(x^+)\|^2$

Uniformity: $\log \mathbb{E}_{(x,x') \sim p_{\text{data}}^2} e^{-2\|f(x) - f(x')\|^2}$

Good contrastive representations minimize alignment (positive pairs are close) while minimizing uniformity (all representations spread evenly on the hypersphere).

Intuition

Alignment alone leads to collapse (map everything to one point). Uniformity alone leads to random representations. The contrastive loss balances both: pull positives together, spread everything else uniformly.

Proof Sketch

Decompose the InfoNCE gradient into a term pulling positive pairs together (alignment) and a term pushing random pairs apart (uniformity). On the unit sphere, the uniform distribution maximizes entropy, and the repulsive term in InfoNCE pushes toward this maximum entropy distribution.

Why It Matters

This decomposition explains why contrastive methods avoid representation collapse without explicit mechanisms like stop-gradients (used in BYOL/SimSiam). The negative samples provide the uniformity pressure that prevents collapse.

Failure Mode

If the number of negatives is too small, uniformity pressure is weak and representations can partially collapse (cluster into a few modes instead of spreading uniformly). Temperature $\tau$ also mediates this tradeoff: very small $\tau$ overweights hard negatives at the expense of uniformity.

report a correction →

Key Architectures

SimCLR

SimCLR (Chen et al., 2020) applies two random augmentations to each image in a batch of size $B$ , producing $2B$ views. Every other augmented view in the batch serves as a negative. The loss uses cosine similarity with temperature $\tau = 0.5$ and a 2-layer MLP projection head on top of the encoder.

Key finding: the projection head is critical. Representations before the projection head transfer better than those after it. The projection head discards information about augmentations that is useful for downstream tasks.

MoCo

MoCo (He et al., 2020) decouples the number of negatives from batch size using a momentum-updated encoder and a queue of past representations. The key encoder is updated by gradient descent; the momentum encoder tracks it via exponential moving average: $\theta_k \leftarrow m \theta_k + (1-m) \theta_q$ with $m = 0.999$ . This allows a large, consistent negative set (65536 negatives) with normal batch sizes.

CLIP: Image-Text Contrastive

CLIP (Radford et al., 2021) applies contrastive learning across modalities. Given a batch of $(image, text)$ pairs, pull matching pairs together and push all non-matching pairs apart. The loss is symmetric InfoNCE applied to both the image-to-text and text-to-image directions. CLIP learns representations that enable zero-shot transfer via natural language prompts.

Common Confusions

Watch Out

log N caps the InfoNCE bound, not the representation

The $\log N$ ceiling lives in the bound $I \geq \log N - \mathcal{L}$ , not in the representation itself. The encoder can encode more MI than $\log N$ ; InfoNCE just cannot certify it with $N$ candidates. In practice, beyond a certain $N$ the bound is tight enough that further gains diminish and compute cost grows quadratically (all pairwise similarities). Diminishing returns set in around $N \sim 4096$ for vision tasks.

Watch Out

Contrastive learning does not maximize all mutual information

It maximizes MI between views as filtered by the augmentation distribution. If augmentations destroy color information, the learned representation will not encode color. The augmentation policy implicitly defines what information is task-relevant.

Watch Out

The projection head is not the final representation

SimCLR trains with a projection head, but you discard it at evaluation time. The representation before the projection head generalizes better because the head absorbs augmentation-specific information that hurts transfer.

Summary

Contrastive learning optimizes a lower bound on mutual information between views; the bound itself saturates at $\log N$ for $N$ candidates, but the encoder can carry more MI than the bound certifies
The loss balances alignment (positive pairs close) and uniformity (representations spread on the hypersphere)
Augmentation choice determines what information is preserved in the learned representation
Large negative sets improve the bound; MoCo achieves this with a momentum encoder and queue
CLIP extends the paradigm to cross-modal (image-text) contrastive learning

Exercises

ExerciseCore

Problem

You train SimCLR with batch size $B = 128$ , producing $2B = 256$ views. How many negative pairs does each anchor have? What is the upper bound on mutual information recoverable from the InfoNCE loss?

ExerciseAdvanced

Problem

Explain why the projection head in SimCLR improves downstream performance even though it is discarded. Specifically: if color jitter is used as an augmentation, what information does the projection head learn to discard, and why is that harmful for downstream tasks?

Related Comparisons

Contrastive Loss vs. Triplet Loss

References

Canonical:

Oord et al., "Representation Learning with Contrastive Predictive Coding" (2018), Section 2
Chen et al., "A Simple Framework for Contrastive Learning" (SimCLR, 2020)

Current:

He et al., "Momentum Contrast for Unsupervised Visual Representation Learning" (MoCo, 2020)
Wang & Isola, "Understanding Contrastive Representation Learning through Alignment and Uniformity" (ICML 2020)
Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021)

Next Topics

Data augmentation theory: why augmentation choices determine representation quality
Diffusion models: an alternative generative paradigm that displaced contrastive pretraining in some domains

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Feedforward Networks and Backpropagationlayer 2 · tier 1

Derived topics

4

CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraininglayer 4 · tier 1
Diffusion Modelslayer 4 · tier 1
Data Augmentation Theorylayer 2 · tier 2
Representation Learning in Cosmologylayer 4 · tier 3

Graph-backed continuations

Data Augmentation Theory Diffusion Models CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraining Representation Learning in Cosmology