ML Methods
Contrastive Learning
Learning representations by pulling positive pairs together and pushing negative pairs apart, with theoretical grounding in mutual information maximization.
Prerequisites
Why This Matters
Contrastive learning is the dominant paradigm for self-supervised representation learning. It produces representations competitive with supervised pretraining while using no labels. The key models in this space (CLIP, SimCLR, MoCo) all use contrastive objectives. Understanding the theory behind contrastive learning explains why augmentation choices matter, why large batch sizes help, and what the loss function actually optimizes.
The Contrastive Setup
Given an input , create two views (positive) by applying random augmentations. Sample other inputs as negatives. Learn an encoder such that and are close while and are far apart.
Positive Pair
Two views form a positive pair if they are derived from the same underlying data point (e.g., two augmentations of the same image). The contrastive objective trains the encoder to map positive pairs to nearby points in representation space.
InfoNCE Loss
For anchor , positive , and negatives where :
Here is cosine similarity and is a temperature parameter. This is a softmax cross-entropy over options where the correct answer is the positive pair.
Core Theory
InfoNCE as Mutual Information Lower Bound
Statement
The InfoNCE loss with negative samples satisfies:
where is the mutual information between the anchor and positive views. Minimizing InfoNCE maximizes a lower bound on mutual information.
Intuition
InfoNCE is an -way classification problem: identify the positive among candidates. Perfect classification (loss ) implies the representations capture bits of shared information. The bound is tight when the critic (similarity function) is optimal.
Proof Sketch
Write the InfoNCE objective as a density ratio estimation problem. The optimal critic is . Apply the variational bound on mutual information from Barber and Agakov (2003). The term appears because the bound saturates: you cannot extract more than bits from an -way classification.
Why It Matters
This explains two empirical observations. First, larger batch sizes (more negatives) give better representations because they raise the ceiling on recoverable mutual information. Second, the quality of augmentations matters because they determine what information is shared between positive pairs and thus what contains.
Failure Mode
The bound becomes loose when is small relative to the true mutual information. With , you can recover at most bits. If the task requires more shared information, increasing batch size is necessary. Also, the bound says nothing about which information is captured; bad augmentations can make the model learn shortcuts.
Alignment and Uniformity Decomposition
Statement
The contrastive loss decomposes into two competing objectives:
Alignment:
Uniformity:
Good contrastive representations minimize alignment (positive pairs are close) while minimizing uniformity (all representations spread evenly on the hypersphere).
Intuition
Alignment alone leads to collapse (map everything to one point). Uniformity alone leads to random representations. The contrastive loss balances both: pull positives together, spread everything else uniformly.
Proof Sketch
Decompose the InfoNCE gradient into a term pulling positive pairs together (alignment) and a term pushing random pairs apart (uniformity). On the unit sphere, the uniform distribution maximizes entropy, and the repulsive term in InfoNCE pushes toward this maximum entropy distribution.
Why It Matters
This decomposition explains why contrastive methods avoid representation collapse without explicit mechanisms like stop-gradients (used in BYOL/SimSiam). The negative samples provide the uniformity pressure that prevents collapse.
Failure Mode
If the number of negatives is too small, uniformity pressure is weak and representations can partially collapse (cluster into a few modes instead of spreading uniformly). Temperature also mediates this tradeoff: very small overweights hard negatives at the expense of uniformity.
Key Architectures
SimCLR
SimCLR (Chen et al., 2020) applies two random augmentations to each image in a batch of size , producing views. Every other augmented view in the batch serves as a negative. The loss uses cosine similarity with temperature and a 2-layer MLP projection head on top of the encoder.
Key finding: the projection head is critical. Representations before the projection head transfer better than those after it. The projection head discards information about augmentations that is useful for downstream tasks.
MoCo
MoCo (He et al., 2020) decouples the number of negatives from batch size using a momentum-updated encoder and a queue of past representations. The key encoder is updated by gradient descent; the momentum encoder tracks it via exponential moving average: with . This allows a large, consistent negative set (65536 negatives) with normal batch sizes.
CLIP: Image-Text Contrastive
CLIP (Radford et al., 2021) applies contrastive learning across modalities. Given a batch of pairs, pull matching pairs together and push all non-matching pairs apart. The loss is symmetric InfoNCE applied to both the image-to-text and text-to-image directions. CLIP learns representations that enable zero-shot transfer via natural language prompts.
Common Confusions
More negatives is not always better in practice
The MI bound says is the ceiling. But beyond a certain , the practical gains diminish and compute cost grows quadratically (all pairwise similarities). Diminishing returns set in around for vision tasks.
Contrastive learning does not maximize all mutual information
It maximizes MI between views as filtered by the augmentation distribution. If augmentations destroy color information, the learned representation will not encode color. The augmentation policy implicitly defines what information is task-relevant.
The projection head is not the final representation
SimCLR trains with a projection head, but you discard it at evaluation time. The representation before the projection head generalizes better because the head absorbs augmentation-specific information that hurts transfer.
Key Takeaways
- Contrastive learning optimizes a lower bound on mutual information between views, capped at bits for negatives
- The loss balances alignment (positive pairs close) and uniformity (representations spread on the hypersphere)
- Augmentation choice determines what information is preserved in the learned representation
- Large negative sets improve the bound; MoCo achieves this with a momentum encoder and queue
- CLIP extends the paradigm to cross-modal (image-text) contrastive learning
Exercises
Problem
You train SimCLR with batch size , producing views. How many negative pairs does each anchor have? What is the upper bound on mutual information recoverable from the InfoNCE loss?
Problem
Explain why the projection head in SimCLR improves downstream performance even though it is discarded. Specifically: if color jitter is used as an augmentation, what information does the projection head learn to discard, and why is that harmful for downstream tasks?
Related Comparisons
References
Canonical:
- Oord et al., "Representation Learning with Contrastive Predictive Coding" (2018), Section 2
- Chen et al., "A Simple Framework for Contrastive Learning" (SimCLR, 2020)
Current:
- He et al., "Momentum Contrast for Unsupervised Visual Representation Learning" (MoCo, 2020)
- Wang & Isola, "Understanding Contrastive Representation Learning through Alignment and Uniformity" (ICML 2020)
- Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021)
Next Topics
- Data augmentation theory: why augmentation choices determine representation quality
- Diffusion models: an alternative generative paradigm that displaced contrastive pretraining in some domains
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A