Modern Generalization
Information Bottleneck
The information bottleneck principle: compress the input X into a representation T that preserves information about the target Y. The Lagrangian formulation, connection to deep learning, Shwartz-Ziv and Tishby claims, and why the compression story may not hold for ReLU networks.
Prerequisites
Why This Matters
The information bottleneck (IB) principle asks: what is the optimal way to compress an input into a representation while preserving as much information as possible about a target ? This is a precise mathematical formulation of what representation learning should do.
Tishby and collaborators proposed that deep neural networks implicitly perform information bottleneck optimization: each layer compresses the input while retaining task-relevant information. If true, this would provide a principled explanation for why deep networks generalize. The claim generated significant excitement, followed by significant pushback. Understanding both the theory and the criticisms is necessary for evaluating what IB does and does not explain about deep learning.
Mental Model
You have a signal and a label . You want to summarize into a shorter description . The description should throw away everything about that is irrelevant to predicting , keeping only the relevant bits.
The information bottleneck formalizes "relevant" as mutual information and "short" as mutual information . You want large and small.
Formal Setup
Information Bottleneck Objective
Given random variables (input) and (target) with joint distribution , find a stochastic mapping from to a representation that solves:
where is a Lagrange multiplier controlling the tradeoff between compression and prediction. The Markov chain is : the representation depends on only through .
Information Plane
The information plane is a 2D plot with on the horizontal axis and on the vertical axis. Each representation corresponds to a point. The IB curve traces the Pareto frontier: the maximum achievable for each value of .
In the deep learning context, each layer of a neural network defines a representation , giving a point in the information plane.
Main Theorems
Information Bottleneck Optimality Conditions
Statement
The optimal encoder for the IB objective satisfies the self-consistent equations:
where is a normalization constant and . The solution maps inputs that have similar conditional distributions to the same representation .
Intuition
Inputs that predict in the same way should get the same representation. The KL divergence measures how much information about is lost by encoding as . The exponential form says: representations that lose little information are preferred, with controlling how much you care about information loss vs. compression.
Proof Sketch
Take the functional derivative of the Lagrangian with respect to , subject to the normalization constraint . Setting the derivative to zero and solving gives the self-consistent equation. This is analogous to the Blahut-Arimoto algorithm for rate-distortion theory.
Why It Matters
The IB solution shows that optimal representations cluster inputs by their relevance to , not by their similarity in -space. Two images that look completely different but have the same label should get the same representation. This is a precise version of "learn features that matter for the task."
Failure Mode
The self-consistent equations require knowing exactly. In practice, we only have samples. Estimating mutual information from samples is notoriously difficult, especially in high dimensions. The IB solution also assumes a fixed , but choosing requires knowing the shape of the IB curve, which requires solving the IB problem for many values.
Data Processing Inequality for Layer Representations
Statement
For a neural network with deterministic layers, the representations form a Markov chain . By the data processing inequality:
Each layer can only lose information, never gain it.
Intuition
Processing data cannot create new information about . Each layer is a (deterministic or stochastic) function of the previous layer. Information about that is discarded at layer is gone forever. The best any subsequent layer can do is preserve what remains.
Proof Sketch
The data processing inequality states: if is a Markov chain, then . Apply this to the chain for each to get the first set of inequalities. Apply it to for the second set.
Why It Matters
This inequality constrains what representations are possible. If a layer discards information about (for compression), subsequent layers cannot recover it. The network must decide what to keep and what to discard, and this decision is irreversible. The IB hypothesis says networks learn to make this decision optimally.
Failure Mode
For deterministic functions with continuous inputs, can be infinite (a deterministic invertible function preserves all information). This is the core of the criticism by Saxe et al. (2018): for ReLU networks without added noise, the mutual information between layers does not decrease during training. The compression observed by Shwartz-Ziv and Tishby (2017) was an artifact of the binning procedure used to estimate mutual information for networks with saturating (tanh) activations.
The Deep Learning Claim and Its Criticism
The Shwartz-Ziv and Tishby claim (2017): During DNN training, there are two phases. In the first (fitting) phase, increases across layers as the network learns to predict. In the second (compression) phase, decreases as the network discards irrelevant information. This compression phase is driven by SGD noise and is responsible for generalization.
The criticism (Saxe et al., 2018):
-
ReLU networks do not compress. For networks with ReLU activations (which are piecewise linear and invertible on their support), the deterministic mapping means , which does not decrease during training. Compression was observed only for tanh networks.
-
Binning artifact. The apparent compression in tanh networks was due to the binning estimator used to compute . As tanh activations saturate during training, the binned estimates of decrease not because information is discarded but because the representations become concentrated in a few bins.
-
Generalization without compression. ReLU networks generalize well without exhibiting the compression phase, contradicting the claim that compression is necessary for generalization.
What IB Does Explain
Despite the criticism, the IB framework provides useful conceptual tools:
-
Sufficient statistics: the optimal IB representation is a minimal sufficient statistic of for . This formalizes the intuition that good features capture everything relevant and nothing else.
-
Rate-distortion analogy: choosing a representation is like choosing a compression scheme. The IB curve is analogous to the rate-distortion curve in information theory.
-
Variational IB (VIB): Alemi et al. (2017) used a variational bound on the IB objective as a regularizer. This adds noise to the representation, making the compression measurable and often improving generalization. VIB is a practical algorithm, regardless of whether DNNs implicitly compress.
Common Confusions
IB is not the same as regularization by compression
IB says: minimize subject to preserving . Standard regularization (e.g., weight decay, dropout) limits model complexity but does not directly optimize mutual information. IB is a principle about representations, not about parameters. The connection to standard regularization is indirect and still debated.
Mutual information is hard to estimate in high dimensions
Computing for a high-dimensional continuous (like the activations of a 512-dimensional hidden layer) is extremely difficult. Naive binning, KDE, and even modern neural estimators (MINE) have high variance or bias in this regime. Claims about during training should be treated skeptically unless the estimation method is carefully validated.
The data processing inequality does not say compression happens
The DPI says is non-increasing across layers. For deterministic invertible mappings, at every layer, so no compression occurs. The DPI sets an upper bound, not a description of what actually happens.
Exercises
Problem
In the IB objective , what happens at the extremes: and ? What representation does the optimizer choose in each case?
Problem
A two-layer neural network maps to to to . All layers use ReLU. Explain why and why this makes the compression phase claim problematic for this network.
References
Canonical:
- Tishby, Pereira, Bialek, The Information Bottleneck Method (2000)
- Shwartz-Ziv & Tishby, Opening the Black Box of Deep Neural Networks via Information (2017)
Current:
- Saxe et al., On the Information Bottleneck Theory of Deep Learning (2018)
- Alemi et al., Deep Variational Information Bottleneck (2017)
Next Topics
- Implicit bias and modern generalization: alternative explanations for why DNNs generalize
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Information Theory FoundationsLayer 0B