Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Modern Generalization

Representation Learning Theory

What makes a good learned representation: the information bottleneck, contrastive learning, sufficient statistics, rate-distortion theory, and why representation learning is the central unsolved problem of deep learning.

AdvancedTier 2Current~60 min

Why This Matters

A neural network's intermediate layers compute a representation Z=f(X)Z = f(X) of the input XX. The quality of this representation determines transfer learning performance, robustness, and generalization. Two models with identical test accuracy can have wildly different representations, and the one with better representations will transfer better to new tasks, be more robust to distribution shift, and require less data for fine-tuning.

The theory of representation learning asks: given an input XX and a task YY, what properties should Z=f(X)Z = f(X) have? Multiple frameworks give partial answers, but no single theory is complete.

What Makes a Good Representation

Three desirable properties appear across frameworks:

Sufficiency. ZZ should contain all information in XX that is relevant to YY. Formally, XZYX \to Z \to Y forms a Markov chain such that I(Z;Y)=I(X;Y)I(Z; Y) = I(X; Y), where II denotes mutual information.

Minimality. ZZ should discard information in XX that is irrelevant to YY. This means I(Z;X)I(Z; X) should be as small as possible, subject to the sufficiency constraint.

Disentanglement. Different generative factors of the data should map to independent components of ZZ. If XX is an image of a face, then pose, lighting, and identity should be encoded in separate dimensions of ZZ.

Information Bottleneck

Definition

Information Bottleneck

The information bottleneck (Tishby et al., 2000) finds a representation ZZ that solves:

minp(zx)I(X;Z)βI(Z;Y)\min_{p(z|x)} I(X; Z) - \beta \, I(Z; Y)

where β>0\beta > 0 controls the trade-off between compression (I(X;Z)I(X; Z) small) and prediction (I(Z;Y)I(Z; Y) large).

Theorem

Information Bottleneck Optimality

Statement

The optimal IB representation ZZ^* satisfies:

p(zx)=p(z)Z(β,x)exp(βDKL[p(yx)p(yz)])p(z|x) = \frac{p(z)}{Z(\beta, x)} \exp\left(\beta \, D_{\text{KL}}[p(y|x) \| p(y|z)]\right)

where Z(β,x)Z(\beta, x) is a normalizing constant. The IB curve (plotting I(Z;Y)I(Z;Y) vs I(Z;X)I(Z;X) as β\beta varies) is concave and traces the Pareto frontier of the compression-prediction trade-off.

Intuition

The optimal representation keeps information about XX only to the extent that it helps predict YY. The KL divergence term measures how much predictive power about YY is lost by compressing XX into ZZ. The parameter β\beta controls how aggressively you compress.

Proof Sketch

This is a constrained optimization problem solvable via Lagrange multipliers. The Lagrangian is I(X;Z)βI(Z;Y)I(X; Z) - \beta I(Z; Y). Taking functional derivatives with respect to p(zx)p(z|x) and setting to zero yields the self-consistent equation above. The concavity of the IB curve follows from the concavity of mutual information in the conditional distribution.

Why It Matters

The IB framework formalizes the intuition that good representations compress irrelevant information while preserving task-relevant information. Tishby et al. (2017) conjectured that deep learning implicitly performs IB optimization during training (the "compression phase" hypothesis). This claim generated significant debate.

Failure Mode

The IB framework requires knowing p(X,Y)p(X, Y), which is unavailable in practice. Variational approximations (VIB, Alemi et al., 2017) replace exact mutual information with tractable bounds, but the bounds can be loose. The "compression phase" hypothesis has been shown to depend on the activation function: networks with ReLU do not always exhibit compression (Saxe et al., 2018).

Contrastive Learning

Contrastive learning learns representations by pulling together representations of "positive pairs" (semantically similar inputs) and pushing apart "negative pairs" (dissimilar inputs).

Definition

InfoNCE Loss

Given a query xx, a positive example x+x^+, and KK negative examples {x1,,xK}\{x_1^-, \ldots, x_K^-\}, the InfoNCE loss is:

LInfoNCE=logexp(f(x)Tf(x+)/τ)exp(f(x)Tf(x+)/τ)+k=1Kexp(f(x)Tf(xk)/τ)\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(f(x)^T f(x^+) / \tau)}{\exp(f(x)^T f(x^+) / \tau) + \sum_{k=1}^{K} \exp(f(x)^T f(x_k^-) / \tau)}

where ff is the encoder and τ\tau is a temperature parameter.

Theorem

InfoNCE Bounds Mutual Information

Statement

The InfoNCE loss provides a lower bound on the mutual information between XX and X+X^+:

I(X;X+)log(K+1)LInfoNCEI(X; X^+) \geq \log(K+1) - \mathcal{L}_{\text{InfoNCE}}

The bound becomes tighter as KK \to \infty, but is capped at log(K+1)\log(K+1) regardless of the true mutual information.

Intuition

InfoNCE is a (K+1)(K+1)-way classification problem: identify the positive from KK negatives. A perfect classifier achieves loss 00, giving the bound Ilog(K+1)I \geq \log(K+1), which is the capacity of the classification task. You cannot estimate mutual information higher than log(K+1)\log(K+1) with only KK negatives, no matter how good the encoder is.

Proof Sketch

The InfoNCE loss is the cross-entropy for the (K+1)(K+1)-way classification problem. The Bayes-optimal classifier uses the density ratio p(x+x)/p(x+)p(x^+|x) / p(x^+). Substituting the optimal classifier into the loss gives log(K+1)I(X;X+)\log(K+1) - I(X; X^+) when I(X;X+)log(K+1)I(X;X^+) \leq \log(K+1), yielding the bound after rearranging.

Why It Matters

This result explains the empirical observation that contrastive learning needs large batch sizes (large KK). SimCLR uses batches of 4096-8192. With small KK, the bound is loose and the encoder cannot distinguish fine-grained features. With large KK, the bound tightens and the encoder must learn more detailed representations to discriminate positives from negatives.

Failure Mode

The bound saturates at log(K+1)\log(K+1). If the true mutual information is much larger (e.g., for high-resolution images), you need exponentially many negatives to get a tight bound. Also, the bound says nothing about which information is captured. The encoder might learn features that distinguish positives from negatives without capturing task-relevant information.

Sufficient Statistics View

Definition

Sufficient Representation

A representation Z=f(X)Z = f(X) is sufficient for task YY if I(Z;Y)=I(X;Y)I(Z; Y) = I(X; Y), equivalently if YXZY \perp X \mid Z. A sufficient representation captures all task-relevant information and discards nothing useful.

A sufficient representation is the ideal. In practice, you want a representation that is approximately sufficient for a family of downstream tasks, not just one. This connects to the notion of a universal representation: one that is sufficient for many tasks simultaneously.

Rate-distortion connection. Rate-distortion theory (Shannon, 1959) asks: what is the minimum number of bits needed to describe XX such that the reconstruction error is at most DD? The information bottleneck extends this by replacing reconstruction error with prediction error. The IB curve is a generalization of the rate-distortion curve from compression theory to supervised learning.

Common Confusions

Watch Out

Disentanglement is not well-defined without inductive bias

Locatello et al. (2019) proved that unsupervised disentanglement is impossible without inductive biases: for any disentangled representation, there exists a generative model consistent with the data that maps to an entangled representation. You need assumptions about the data-generating process (e.g., independence of factors) or supervision to achieve disentanglement.

Watch Out

Maximizing mutual information is not always good

A representation that maximizes I(Z;X)I(Z; X) is just the identity function. You want to maximize I(Z;Y)I(Z; Y) while keeping I(Z;X)I(Z; X) small. Representations that capture too much about XX encode noise and spurious correlations that hurt generalization.

Watch Out

Contrastive learning does not require labels

Contrastive learning constructs positive pairs from data augmentation (two views of the same image) rather than from labels. This is self-supervised. However, the quality of the representation depends heavily on the augmentation strategy: the augmentations implicitly define what information the representation should be invariant to.

Exercises

ExerciseCore

Problem

In the InfoNCE loss with K=255K = 255 negative samples, what is the maximum mutual information you can estimate? If your dataset has images with true mutual information of 20 nats between two augmented views, approximately how many negatives do you need for a tight bound?

ExerciseAdvanced

Problem

Prove that a minimal sufficient statistic T(X)T(X) for YY achieves the endpoint of the information bottleneck curve (maximum I(T;Y)I(T; Y) with minimum I(T;X)I(T; X)). Why does this imply that the IB objective at β\beta \to \infty recovers the minimal sufficient statistic?

References

Canonical:

  • Tishby, Pereira, Bialek, "The Information Bottleneck Method" (2000)
  • Oord, Li, Vinyals, "Representation Learning with Contrastive Predictive Coding" (2018)

Current:

  • Locatello et al., "Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations" (ICML 2019)
  • Alemi et al., "Deep Variational Information Bottleneck" (ICLR 2017)

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.