Representation Learning Theory

Sneiderman, Robby

Modern Generalization

Representation Learning Theory

What makes a good learned representation: the information bottleneck, contrastive learning, sufficient statistics, rate-distortion theory, and why representation learning is the central unsolved problem of deep learning.

ImportantAdvancedTier 2CurrentSupporting~60 min

For:ML

Prerequisites

Information Theory Foundations Variational Autoencoders Equivariant Deep Learning Hyperbolic Embeddings for Graphs

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

modern-generalization | layer 3 | tier 2. This page has 5 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Representation learning as a statistical bottleneck: keep task signal, discard unstable shortcuts.

A neural network's intermediate layers compute a representation $Z = f(X)$ of the input $X$ . The quality of this representation determines transfer learning performance, robustness, and generalization. Two models with identical test accuracy can have wildly different representations, and the one with better representations will transfer better to new tasks, be more robust to distribution shift, and require less data for fine-tuning.

The theory of representation learning asks: given an input $X$ and a task $Y$ , what properties should $Z = f(X)$ have? Multiple frameworks give partial answers, but no single theory is complete.

What Makes a Good Representation

Three desirable properties appear across frameworks:

Sufficiency. $Z$ should contain all information in $X$ that is relevant to $Y$ . Formally, $X \to Z \to Y$ forms a Markov chain such that $I(Z; Y) = I(X; Y)$ , where $I$ denotes mutual information.

Minimality. $Z$ should discard information in $X$ that is irrelevant to $Y$ . This means $I(Z; X)$ should be as small as possible, subject to the sufficiency constraint.

Disentanglement. Different generative factors of the data should map to independent components of $Z$ . If $X$ is an image of a face, then pose, lighting, and identity should be encoded in separate dimensions of $Z$ .

These properties pull against each other. A representation that is perfectly sufficient for one supervised task can throw away information needed for a new task. A representation that keeps every detail of $X$ transfers broadly but may keep nuisance variation, shortcut features, and private attributes. Representation learning is hard because the training signal rarely tells you which invariances will be valid after deployment.

The diagram separates four objects that are often merged in loose ML writing: the raw input $X$ , the encoder $f$ , the representation $Z$ , and the downstream task. This separation is not pedantry. A loss can improve $Z$ for one task while destroying information needed for another. A representation can be minimal for a supervised label while being too compressed for transfer. A contrastive objective can recover density-ratio information for a positive-pair task while learning a geometry that is brittle under shift.

Lens	Object being controlled	Typical objective	Main risk
Information bottleneck	$I(Z;X)$ and $I(Z;Y)$	compress $X$ while preserving predictive information	variational bounds may be loose
Contrastive learning	similarity of positive and negative pairs	InfoNCE, triplet loss, pairwise sigmoid losses	negatives define the wrong task
Reconstruction	recover enough of $X$ from $Z$	autoencoder or likelihood objective	nuisance detail can dominate
Equivariance and invariance	response to transformations	augmentation, group actions, architecture	invalid invariances erase signal

This gives a practical rule for reading papers: do not ask whether the representation is "good" in isolation. Ask what family of tasks it is approximately sufficient for, what nuisance variation it is supposed to ignore, and what evidence shows that the learned geometry supports that claim.

Research Reading Checklist

When reading a representation-learning paper, separate four claims before accepting the headline result:

Claim type	What to check	Common weak point
Objective claim	What loss is optimized, and what population quantity does it bound?	The surrogate bound may be loose or capped.
Invariance claim	Which transformations should leave the label or downstream task unchanged?	Augmentations can remove information that later matters.
Transfer claim	Which target tasks improve, and are they close to the pretraining distribution?	Linear probes can hide inaccessible nonlinear information.
Causal claim	Does the representation actually mediate the model behavior?	Probe accuracy alone is correlational.

For ML research readiness, this checklist matters because representation papers often mix theorem language, empirical transfer results, and design intuition. The theorem may justify only the contrastive objective; the transfer claim still depends on augmentations, architecture, data scale, and evaluation protocol.

Information Bottleneck

Definition

Information Bottleneck $I B$

The information bottleneck (Tishby et al., 2000) finds a representation $Z$ that solves:

$\min_{p(z|x)} I(X; Z) - \beta \, I(Z; Y)$

where $\beta > 0$ controls the trade-off between compression ( $I(X; Z)$ small) and prediction ( $I(Z; Y)$ large).

Theorem

Information Bottleneck Optimality

Statement

The optimal IB representation $Z^*$ satisfies:

$p(z|x) = \frac{p(z)}{Z(\beta, x)} \exp\left(-\beta \, D_{\text{KL}}[p(y|x) \| p(y|z)]\right)$

where $Z(\beta, x)$ is a normalizing constant. The IB curve (plotting $I(Z;Y)$ vs $I(Z;X)$ as $\beta$ varies) is concave and traces the Pareto frontier of the compression-prediction trade-off.

Intuition

The optimal representation keeps information about $X$ only to the extent that it helps predict $Y$ . The KL divergence term measures how much predictive power about $Y$ is lost by compressing $X$ into $Z$ . The parameter $\beta$ controls how aggressively you compress.

Proof Sketch

This is a constrained optimization problem solvable via Lagrange multipliers. The Lagrangian is $I(X; Z) - \beta I(Z; Y)$ . Taking functional derivatives with respect to $p(z|x)$ and setting to zero yields the self-consistent equation above. The concavity of the IB curve follows from the concavity of mutual information in the conditional distribution.

Why It Matters

The IB framework formalizes the intuition that good representations compress irrelevant information while preserving task-relevant information. Tishby et al. (2017) conjectured that deep learning implicitly performs IB optimization during training (the "compression phase" hypothesis). This claim generated significant debate.

Failure Mode

The IB framework requires knowing $p(X, Y)$ , which is unavailable in practice. Variational approximations (VIB, Alemi et al., 2017) replace exact mutual information with tractable bounds, but the bounds can be loose. The "compression phase" hypothesis has been shown to depend on the activation function: networks with ReLU do not always exhibit compression (Saxe et al., 2018).

report a correction →

Contrastive Learning

Contrastive learning learns representations by pulling together representations of "positive pairs" (semantically similar inputs) and pushing apart "negative pairs" (dissimilar inputs).

Definition

InfoNCE Loss

Given a query $x$ , a positive example $x^+$ , and $K$ negative examples $\{x_1^-, \ldots, x_K^-\}$ , the InfoNCE loss is:

$\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(f(x)^T f(x^+) / \tau)}{\exp(f(x)^T f(x^+) / \tau) + \sum_{k=1}^{K} \exp(f(x)^T f(x_k^-) / \tau)}$

where $f$ is the encoder and $\tau$ is a temperature parameter.

Theorem

InfoNCE Bounds Mutual Information

Statement

The InfoNCE loss provides a lower bound on the mutual information between $X$ and $X^+$ :

$I(X; X^+) \geq \log(K+1) - \mathcal{L}_{\text{InfoNCE}}$

The bound becomes tighter as $K \to \infty$ , but is capped at $\log(K+1)$ regardless of the true mutual information.

Intuition

InfoNCE is a $(K+1)$ -way classification problem: identify the positive from $K$ negatives. A perfect classifier achieves loss $0$ , giving the bound $I \geq \log(K+1)$ , which is the capacity of the classification task. You cannot estimate mutual information higher than $\log(K+1)$ with only $K$ negatives, no matter how good the encoder is.

Proof Sketch

The InfoNCE loss is the cross-entropy for the $(K+1)$ -way classification problem. The Bayes-optimal classifier uses the density ratio $p(x^+|x) / p(x^+)$ . Substituting the optimal classifier into the loss gives $\log(K+1) - I(X; X^+)$ when $I(X;X^+) \leq \log(K+1)$ , yielding the bound after rearranging.

Why It Matters

This result explains the empirical observation that contrastive learning needs large batch sizes (large $K$ ). SimCLR uses batches of 4096-8192. With small $K$ , the bound is loose and the encoder cannot distinguish fine-grained features. With large $K$ , the bound tightens and the encoder must learn more detailed representations to discriminate positives from negatives.

Failure Mode

The bound saturates at $\log(K+1)$ . If the true mutual information is much larger (e.g., for high-resolution images), you need exponentially many negatives to get a tight bound. Also, the bound says nothing about which information is captured. The encoder might learn features that distinguish positives from negatives without capturing task-relevant information.

report a correction →

Sufficient Statistics View

Definition

Sufficient Representation

A representation $Z = f(X)$ is sufficient for task $Y$ if and only if $I(Z; Y) = I(X; Y)$ , equivalently if $Y \perp X \mid Z$ . A sufficient representation captures all task-relevant information and discards nothing useful.

A sufficient representation is the ideal. In practice, you want a representation that is approximately sufficient for a family of downstream tasks rather than a single target. This connects to the notion of a universal representation: one that is sufficient for many tasks simultaneously.

Rate-distortion connection. Rate-distortion theory (Shannon, 1959) asks: what is the minimum number of bits needed to describe $X$ such that the reconstruction error is at most $D$ ? The information bottleneck extends this by replacing reconstruction error with prediction error. The IB curve is a generalization of the rate-distortion curve from compression theory to supervised learning.

Example

Transfer failure from the wrong invariance

Suppose a self-supervised vision model is trained with heavy color jitter. For object classification, this may be a good inductive bias: the label "dog" should not change because the image hue shifts slightly. For a medical imaging task where color intensity encodes stain concentration, the same augmentation can erase signal. The representation may score well on a generic linear probe while becoming insufficient for the clinical downstream label.

Augmentation is not bad by default; invariance is a statistical assumption. You should be able to say which nuisance group the representation is quotienting out and why that quotient is valid for the target task.

Failure Modes in Practice

Example

A representation can be predictive but not useful

A high-dimensional representation can contain label information in a brittle direction that a large nonlinear head can recover, while a small linear probe cannot. Conversely, a representation can score well under a linear probe by encoding dataset artifacts rather than causal factors. Probe performance is a diagnostic, not a full theory of representation quality.

Example

Contrastive negatives can define the wrong task

In InfoNCE, the negatives define what counts as a confusable alternative. If negatives are too easy, the encoder can solve the task using superficial features. If negatives are sampled across semantically related instances, the loss can push apart examples that should share a representation. This is why batch construction, memory banks, and hard-negative mining are not engineering details; they change the statistical task.

Watch Out

A sufficient representation for one task is not universal

If $Z$ is sufficient for $Y$ , then $I(Z;Y)=I(X;Y)$ . That says nothing about a different label $Y'$ . A representation that discards background information may be sufficient for object identity and insufficient for scene recognition. Universal representation learning requires assumptions about a family of tasks rather than a single supervised label.

What to Know Cold

You should be able to state three boundaries without looking them up:

IB boundary. The information bottleneck is a population objective over stochastic encoders. Neural VIB uses variational bounds and finite samples, so the practical objective is not the exact IB curve.
InfoNCE boundary. The mutual-information lower bound is capped by $\log(K+1)$ . More negatives can tighten the bound, but the bound still does not tell you whether the captured information is task-relevant.
Disentanglement boundary. Unsupervised disentanglement is not identified from observations alone. You need inductive bias, supervision, or explicit assumptions about the generative factors.

How This Page Connects to the ML Spine

Representation learning sits between model architecture and generalization theory. It includes generalization bounds, sufficiency conditions, and geometric constraints that self-supervised losses alone do not capture.

From transformers: attention and residual streams define where representations live. A transformer block is a sequence of representation updates; each block refines the latent geometry before the final linear head reads it off.
From information theory: mutual information, KL divergence, and rate-distortion supply the language for compression and relevance. The hard part is estimating or bounding these quantities in high dimension.
From variational autoencoders: latent variables show how a representation can be trained through reconstruction or likelihood bounds rather than contrastive discrimination.
From neural tangent kernel: the lazy regime gives a boundary case where the learned representation barely moves. Feature learning begins when training changes the representation itself.
From implicit bias: optimizer, initialization, normalization, and architecture affect which representation is selected among many training-loss minimizers.
From mechanistic interpretability: a learned representation is useful only if we can say what information is encoded and whether the model actually uses it.

This is why representation-learning claims should be read at two levels. The loss-level question is: what objective is being optimized, and what theorem or bound explains it? The model-level question is: what geometry did the network actually learn, and does that geometry support transfer, stress tests, or causal interpretation? A strong paper makes both levels explicit. A weak paper reports linear-probe accuracy and leaves the representation claim underspecified.

Minimal Research Note Template

For a reading note, write four short paragraphs:

Object. Define $X$ , $Y$ , $Z$ , the encoder, and the evaluation task.
Training signal. State whether supervision, augmentation, reconstruction, contrastive negatives, or a variational bound supplies the learning signal.
Claim. Separate theorem, empirical transfer result, and interpretive story. Do not merge these into one sentence.
Failure mode. Name the most likely way the representation could look good in the reported metric while failing on a new task.

For example, a CLIP-style model learns an image-text representation from paired captions and in-batch negatives. The theorem-level story is not that the representation is universally semantic; it is that the contrastive task learns density-ratio information useful for distinguishing matched pairs from negatives. The empirical story is zero-shot and transfer performance on named benchmarks. The failure-mode story is dataset bias, caption ambiguity, and the fact that similarity in embedding space need not match causal or safety relevant similarity.

Common Confusions

Watch Out

Disentanglement is not well-defined without inductive bias

Locatello et al. (2019) proved that unsupervised disentanglement is impossible without inductive biases: for any disentangled representation, there exists a generative model consistent with the data that maps to an entangled representation. You need assumptions about the data-generating process (e.g., independence of factors) or supervision to achieve disentanglement.

Watch Out

Maximizing mutual information is not always good

A representation that maximizes $I(Z; X)$ is just the identity function. You want to maximize $I(Z; Y)$ while keeping $I(Z; X)$ small. Representations that capture too much about $X$ encode noise and spurious correlations that hurt generalization.

Watch Out

Contrastive learning does not require labels

Contrastive learning constructs positive pairs from data augmentation (two views of the same image) rather than from labels. This is self-supervised. However, the quality of the representation depends heavily on the augmentation strategy: the augmentations implicitly define what information the representation should be invariant to.

Exercises

ExerciseCore

Problem

In the InfoNCE loss with $K = 255$ negative samples, what is the maximum mutual information you can estimate? If your dataset has images with true mutual information of 20 nats between two augmented views, approximately how many negatives do you need for a tight bound?

ExerciseAdvanced

Problem

Show that, in the idealized population information bottleneck problem (joint distribution $p(x,y)$ known exactly, encoder ranging over all stochastic maps), a minimal sufficient statistic $T(X)$ for $Y$ achieves the endpoint of the IB curve (maximum $I(T; Y)$ with minimum $I(T; X)$ ). Argue that this gives a population-level intuition that large $\beta$ emphasizes sufficiency, with minimality appearing as a tie-break among representations achieving maximal predictive information. Then explain why this does not automatically imply that finite- $\beta$ neural IB trained with variational bounds, finite samples, and gradient-based optimization will actually recover a minimal sufficient statistic.

References

Canonical:

Bengio, Courville, Vincent, "Representation Learning: A Review and New Perspectives" (IEEE TPAMI 2013), Sections 2-4 for distributed representations, priors, and autoencoder-style representation learning.
Tishby, Pereira, Bialek, "The Information Bottleneck Method" (2000), Sections 2-4 for relevant quantization, the IB variational equations, and the Blahut-Arimoto-style algorithm.
Oord, Li, Vinyals, "Representation Learning with Contrastive Predictive Coding" (2018), Sections 2-3 for CPC and the InfoNCE objective.

Current:

Alemi, Fischer, Dillon, Murphy, "Deep Variational Information Bottleneck" (ICLR 2017), Sections 2-3 for the variational IB objective and neural stochastic encoder setup.
Locatello et al., "Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations" (ICML 2019), Sections 2-4 for the impossibility result and empirical protocol.
Hjelm et al., "Learning Deep Representations by Mutual Information Estimation and Maximization" (ICLR 2019), Section 3.1 for the local Deep InfoMax objective and Jensen-Shannon mutual-information estimator.
Chen, Kornblith, Norouzi, Hinton, "A Simple Framework for Contrastive Learning of Visual Representations" (ICML 2020), Section 3.1 for the NT-Xent loss and Section 4 for projection-head and augmentation ablations.
Wang, Isola, "Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere" (ICML 2020), Theorem 1 for the asymptotic contrastive-loss decomposition into alignment and uniformity terms.
Grill et al., "Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning" (NeurIPS 2020), Section 2 for the online/target-network update rule and stop-gradient mechanism.

Last reviewed: July 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Variational Autoencoderslayer 3 · tier 1
Information Theory Foundationslayer 0B · tier 2
Hyperbolic Embeddings for Graphslayer 2 · tier 2
Equivariant Deep Learninglayer 4 · tier 2
Representation Learning in Cosmologylayer 4 · tier 3

Derived topics

0

No published topic currently declares this as a prerequisite.