Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Contrastive Learning

Learning representations by pulling positive pairs together and pushing negative pairs apart, with theoretical grounding in mutual information maximization.

AdvancedTier 2Current~55 min

Why This Matters

Contrastive learning is the dominant paradigm for self-supervised representation learning. It produces representations competitive with supervised pretraining while using no labels. The key models in this space (CLIP, SimCLR, MoCo) all use contrastive objectives. Understanding the theory behind contrastive learning explains why augmentation choices matter, why large batch sizes help, and what the loss function actually optimizes.

The Contrastive Setup

Given an input xx, create two views x+x^+ (positive) by applying random augmentations. Sample N1N-1 other inputs {x1,,xN1}\{x_1^-, \ldots, x_{N-1}^-\} as negatives. Learn an encoder ff such that f(x)f(x) and f(x+)f(x^+) are close while f(x)f(x) and f(xi)f(x_i^-) are far apart.

Definition

Positive Pair

Two views (xi,xj)(x_i, x_j) form a positive pair if they are derived from the same underlying data point (e.g., two augmentations of the same image). The contrastive objective trains the encoder to map positive pairs to nearby points in representation space.

Definition

InfoNCE Loss

For anchor zz, positive z+z^+, and N1N-1 negatives {z1,,zN1}\{z_1^-, \ldots, z_{N-1}^-\} where z=f(x)z = f(x):

LInfoNCE=logexp(sim(z,z+)/τ)exp(sim(z,z+)/τ)+j=1N1exp(sim(z,zj)/τ)\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(z, z^+)/\tau)}{\exp(\text{sim}(z, z^+)/\tau) + \sum_{j=1}^{N-1} \exp(\text{sim}(z, z_j^-)/\tau)}

Here sim(,)\text{sim}(\cdot,\cdot) is cosine similarity and τ>0\tau > 0 is a temperature parameter. This is a softmax cross-entropy over NN options where the correct answer is the positive pair.

Core Theory

Theorem

InfoNCE as Mutual Information Lower Bound

Statement

The InfoNCE loss with NN negative samples satisfies:

I(X;X+)logNLInfoNCEI(X; X^+) \geq \log N - \mathcal{L}_{\text{InfoNCE}}

where I(X;X+)I(X; X^+) is the mutual information between the anchor and positive views. Minimizing InfoNCE maximizes a lower bound on mutual information.

Intuition

InfoNCE is an NN-way classification problem: identify the positive among NN candidates. Perfect classification (loss =0= 0) implies the representations capture logN\log N bits of shared information. The bound is tight when the critic (similarity function) is optimal.

Proof Sketch

Write the InfoNCE objective as a density ratio estimation problem. The optimal critic is sim(z,z+)/τlogp(z+z)p(z+)\text{sim}(z, z^+)/\tau \propto \log \frac{p(z^+ \mid z)}{p(z^+)}. Apply the variational bound on mutual information from Barber and Agakov (2003). The logN\log N term appears because the bound saturates: you cannot extract more than logN\log N bits from an NN-way classification.

Why It Matters

This explains two empirical observations. First, larger batch sizes (more negatives) give better representations because they raise the logN\log N ceiling on recoverable mutual information. Second, the quality of augmentations matters because they determine what information is shared between positive pairs and thus what I(X;X+)I(X; X^+) contains.

Failure Mode

The bound becomes loose when NN is small relative to the true mutual information. With N=256N = 256, you can recover at most log2565.5\log 256 \approx 5.5 bits. If the task requires more shared information, increasing batch size is necessary. Also, the bound says nothing about which information is captured; bad augmentations can make the model learn shortcuts.

Proposition

Alignment and Uniformity Decomposition

Statement

The contrastive loss decomposes into two competing objectives:

Alignment: E(x,x+)pposf(x)f(x+)2\mathbb{E}_{(x,x^+) \sim p_{\text{pos}}} \|f(x) - f(x^+)\|^2

Uniformity: logE(x,x)pdata2e2f(x)f(x)2\log \mathbb{E}_{(x,x') \sim p_{\text{data}}^2} e^{-2\|f(x) - f(x')\|^2}

Good contrastive representations minimize alignment (positive pairs are close) while minimizing uniformity (all representations spread evenly on the hypersphere).

Intuition

Alignment alone leads to collapse (map everything to one point). Uniformity alone leads to random representations. The contrastive loss balances both: pull positives together, spread everything else uniformly.

Proof Sketch

Decompose the InfoNCE gradient into a term pulling positive pairs together (alignment) and a term pushing random pairs apart (uniformity). On the unit sphere, the uniform distribution maximizes entropy, and the repulsive term in InfoNCE pushes toward this maximum entropy distribution.

Why It Matters

This decomposition explains why contrastive methods avoid representation collapse without explicit mechanisms like stop-gradients (used in BYOL/SimSiam). The negative samples provide the uniformity pressure that prevents collapse.

Failure Mode

If the number of negatives is too small, uniformity pressure is weak and representations can partially collapse (cluster into a few modes instead of spreading uniformly). Temperature τ\tau also mediates this tradeoff: very small τ\tau overweights hard negatives at the expense of uniformity.

Key Architectures

SimCLR

SimCLR (Chen et al., 2020) applies two random augmentations to each image in a batch of size BB, producing 2B2B views. Every other augmented view in the batch serves as a negative. The loss uses cosine similarity with temperature τ=0.5\tau = 0.5 and a 2-layer MLP projection head on top of the encoder.

Key finding: the projection head is critical. Representations before the projection head transfer better than those after it. The projection head discards information about augmentations that is useful for downstream tasks.

MoCo

MoCo (He et al., 2020) decouples the number of negatives from batch size using a momentum-updated encoder and a queue of past representations. The key encoder is updated by gradient descent; the momentum encoder tracks it via exponential moving average: θkmθk+(1m)θq\theta_k \leftarrow m \theta_k + (1-m) \theta_q with m=0.999m = 0.999. This allows a large, consistent negative set (65536 negatives) with normal batch sizes.

CLIP: Image-Text Contrastive

CLIP (Radford et al., 2021) applies contrastive learning across modalities. Given a batch of (image,text)(image, text) pairs, pull matching pairs together and push all non-matching pairs apart. The loss is symmetric InfoNCE applied to both the image-to-text and text-to-image directions. CLIP learns representations that enable zero-shot transfer via natural language prompts.

Common Confusions

Watch Out

More negatives is not always better in practice

The MI bound says logN\log N is the ceiling. But beyond a certain NN, the practical gains diminish and compute cost grows quadratically (all pairwise similarities). Diminishing returns set in around N4096N \sim 4096 for vision tasks.

Watch Out

Contrastive learning does not maximize all mutual information

It maximizes MI between views as filtered by the augmentation distribution. If augmentations destroy color information, the learned representation will not encode color. The augmentation policy implicitly defines what information is task-relevant.

Watch Out

The projection head is not the final representation

SimCLR trains with a projection head, but you discard it at evaluation time. The representation before the projection head generalizes better because the head absorbs augmentation-specific information that hurts transfer.

Key Takeaways

  • Contrastive learning optimizes a lower bound on mutual information between views, capped at logN\log N bits for NN negatives
  • The loss balances alignment (positive pairs close) and uniformity (representations spread on the hypersphere)
  • Augmentation choice determines what information is preserved in the learned representation
  • Large negative sets improve the bound; MoCo achieves this with a momentum encoder and queue
  • CLIP extends the paradigm to cross-modal (image-text) contrastive learning

Exercises

ExerciseCore

Problem

You train SimCLR with batch size B=128B = 128, producing 2B=2562B = 256 views. How many negative pairs does each anchor have? What is the upper bound on mutual information recoverable from the InfoNCE loss?

ExerciseAdvanced

Problem

Explain why the projection head in SimCLR improves downstream performance even though it is discarded. Specifically: if color jitter is used as an augmentation, what information does the projection head learn to discard, and why is that harmful for downstream tasks?

Related Comparisons

References

Canonical:

  • Oord et al., "Representation Learning with Contrastive Predictive Coding" (2018), Section 2
  • Chen et al., "A Simple Framework for Contrastive Learning" (SimCLR, 2020)

Current:

  • He et al., "Momentum Contrast for Unsupervised Visual Representation Learning" (MoCo, 2020)
  • Wang & Isola, "Understanding Contrastive Representation Learning through Alignment and Uniformity" (ICML 2020)
  • Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics