Modern Generalization
Representation Learning Theory
What makes a good learned representation: the information bottleneck, contrastive learning, sufficient statistics, rate-distortion theory, and why representation learning is the central unsolved problem of deep learning.
Why This Matters
A neural network's intermediate layers compute a representation of the input . The quality of this representation determines transfer learning performance, robustness, and generalization. Two models with identical test accuracy can have wildly different representations, and the one with better representations will transfer better to new tasks, be more robust to distribution shift, and require less data for fine-tuning.
The theory of representation learning asks: given an input and a task , what properties should have? Multiple frameworks give partial answers, but no single theory is complete.
What Makes a Good Representation
Three desirable properties appear across frameworks:
Sufficiency. should contain all information in that is relevant to . Formally, forms a Markov chain such that , where denotes mutual information.
Minimality. should discard information in that is irrelevant to . This means should be as small as possible, subject to the sufficiency constraint.
Disentanglement. Different generative factors of the data should map to independent components of . If is an image of a face, then pose, lighting, and identity should be encoded in separate dimensions of .
Information Bottleneck
Information Bottleneck
The information bottleneck (Tishby et al., 2000) finds a representation that solves:
where controls the trade-off between compression ( small) and prediction ( large).
Information Bottleneck Optimality
Statement
The optimal IB representation satisfies:
where is a normalizing constant. The IB curve (plotting vs as varies) is concave and traces the Pareto frontier of the compression-prediction trade-off.
Intuition
The optimal representation keeps information about only to the extent that it helps predict . The KL divergence term measures how much predictive power about is lost by compressing into . The parameter controls how aggressively you compress.
Proof Sketch
This is a constrained optimization problem solvable via Lagrange multipliers. The Lagrangian is . Taking functional derivatives with respect to and setting to zero yields the self-consistent equation above. The concavity of the IB curve follows from the concavity of mutual information in the conditional distribution.
Why It Matters
The IB framework formalizes the intuition that good representations compress irrelevant information while preserving task-relevant information. Tishby et al. (2017) conjectured that deep learning implicitly performs IB optimization during training (the "compression phase" hypothesis). This claim generated significant debate.
Failure Mode
The IB framework requires knowing , which is unavailable in practice. Variational approximations (VIB, Alemi et al., 2017) replace exact mutual information with tractable bounds, but the bounds can be loose. The "compression phase" hypothesis has been shown to depend on the activation function: networks with ReLU do not always exhibit compression (Saxe et al., 2018).
Contrastive Learning
Contrastive learning learns representations by pulling together representations of "positive pairs" (semantically similar inputs) and pushing apart "negative pairs" (dissimilar inputs).
InfoNCE Loss
Given a query , a positive example , and negative examples , the InfoNCE loss is:
where is the encoder and is a temperature parameter.
InfoNCE Bounds Mutual Information
Statement
The InfoNCE loss provides a lower bound on the mutual information between and :
The bound becomes tighter as , but is capped at regardless of the true mutual information.
Intuition
InfoNCE is a -way classification problem: identify the positive from negatives. A perfect classifier achieves loss , giving the bound , which is the capacity of the classification task. You cannot estimate mutual information higher than with only negatives, no matter how good the encoder is.
Proof Sketch
The InfoNCE loss is the cross-entropy for the -way classification problem. The Bayes-optimal classifier uses the density ratio . Substituting the optimal classifier into the loss gives when , yielding the bound after rearranging.
Why It Matters
This result explains the empirical observation that contrastive learning needs large batch sizes (large ). SimCLR uses batches of 4096-8192. With small , the bound is loose and the encoder cannot distinguish fine-grained features. With large , the bound tightens and the encoder must learn more detailed representations to discriminate positives from negatives.
Failure Mode
The bound saturates at . If the true mutual information is much larger (e.g., for high-resolution images), you need exponentially many negatives to get a tight bound. Also, the bound says nothing about which information is captured. The encoder might learn features that distinguish positives from negatives without capturing task-relevant information.
Sufficient Statistics View
Sufficient Representation
A representation is sufficient for task if , equivalently if . A sufficient representation captures all task-relevant information and discards nothing useful.
A sufficient representation is the ideal. In practice, you want a representation that is approximately sufficient for a family of downstream tasks, not just one. This connects to the notion of a universal representation: one that is sufficient for many tasks simultaneously.
Rate-distortion connection. Rate-distortion theory (Shannon, 1959) asks: what is the minimum number of bits needed to describe such that the reconstruction error is at most ? The information bottleneck extends this by replacing reconstruction error with prediction error. The IB curve is a generalization of the rate-distortion curve from compression theory to supervised learning.
Common Confusions
Disentanglement is not well-defined without inductive bias
Locatello et al. (2019) proved that unsupervised disentanglement is impossible without inductive biases: for any disentangled representation, there exists a generative model consistent with the data that maps to an entangled representation. You need assumptions about the data-generating process (e.g., independence of factors) or supervision to achieve disentanglement.
Maximizing mutual information is not always good
A representation that maximizes is just the identity function. You want to maximize while keeping small. Representations that capture too much about encode noise and spurious correlations that hurt generalization.
Contrastive learning does not require labels
Contrastive learning constructs positive pairs from data augmentation (two views of the same image) rather than from labels. This is self-supervised. However, the quality of the representation depends heavily on the augmentation strategy: the augmentations implicitly define what information the representation should be invariant to.
Exercises
Problem
In the InfoNCE loss with negative samples, what is the maximum mutual information you can estimate? If your dataset has images with true mutual information of 20 nats between two augmented views, approximately how many negatives do you need for a tight bound?
Problem
Prove that a minimal sufficient statistic for achieves the endpoint of the information bottleneck curve (maximum with minimum ). Why does this imply that the IB objective at recovers the minimal sufficient statistic?
References
Canonical:
- Tishby, Pereira, Bialek, "The Information Bottleneck Method" (2000)
- Oord, Li, Vinyals, "Representation Learning with Contrastive Predictive Coding" (2018)
Current:
- Locatello et al., "Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations" (ICML 2019)
- Alemi et al., "Deep Variational Information Bottleneck" (ICLR 2017)
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Information Theory FoundationsLayer 0B
- Variational AutoencodersLayer 3
- AutoencodersLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A