Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Modern Generalization

Information Bottleneck

The information bottleneck principle: compress the input X into a representation T that preserves information about the target Y. The Lagrangian formulation, connection to deep learning, Shwartz-Ziv and Tishby claims, and why the compression story may not hold for ReLU networks.

AdvancedTier 3Current~50 min

Why This Matters

The information bottleneck (IB) principle asks: what is the optimal way to compress an input XX into a representation TT while preserving as much information as possible about a target YY? This is a precise mathematical formulation of what representation learning should do.

Tishby and collaborators proposed that deep neural networks implicitly perform information bottleneck optimization: each layer compresses the input while retaining task-relevant information. If true, this would provide a principled explanation for why deep networks generalize. The claim generated significant excitement, followed by significant pushback. Understanding both the theory and the criticisms is necessary for evaluating what IB does and does not explain about deep learning.

Mental Model

You have a signal XX and a label YY. You want to summarize XX into a shorter description TT. The description should throw away everything about XX that is irrelevant to predicting YY, keeping only the relevant bits.

The information bottleneck formalizes "relevant" as mutual information I(T;Y)I(T; Y) and "short" as mutual information I(X;T)I(X; T). You want I(T;Y)I(T; Y) large and I(X;T)I(X; T) small.

Formal Setup

Definition

Information Bottleneck Objective

Given random variables XX (input) and YY (target) with joint distribution p(x,y)p(x, y), find a stochastic mapping p(tx)p(t | x) from XX to a representation TT that solves:

minp(tx)  I(X;T)βI(T;Y)\min_{p(t|x)} \; I(X; T) - \beta \, I(T; Y)

where β>0\beta > 0 is a Lagrange multiplier controlling the tradeoff between compression and prediction. The Markov chain is YXTY - X - T: the representation TT depends on YY only through XX.

Definition

Information Plane

The information plane is a 2D plot with I(X;T)I(X; T) on the horizontal axis and I(T;Y)I(T; Y) on the vertical axis. Each representation TT corresponds to a point. The IB curve traces the Pareto frontier: the maximum I(T;Y)I(T; Y) achievable for each value of I(X;T)I(X; T).

In the deep learning context, each layer ll of a neural network defines a representation TlT_l, giving a point (I(X;Tl),I(Tl;Y))(I(X; T_l), I(T_l; Y)) in the information plane.

Main Theorems

Theorem

Information Bottleneck Optimality Conditions

Statement

The optimal encoder p(tx)p(t | x) for the IB objective satisfies the self-consistent equations:

p(tx)=p(t)Z(x,β)exp(βDKL(p(yx)p(yt)))p(t | x) = \frac{p(t)}{Z(x, \beta)} \exp\left(-\beta \, D_{\text{KL}}(p(y|x) \| p(y|t))\right)

where Z(x,β)Z(x, \beta) is a normalization constant and p(yt)=xp(yx)p(xt)p(y | t) = \sum_x p(y|x) p(x|t). The solution maps inputs xx that have similar conditional distributions p(yx)p(y|x) to the same representation tt.

Intuition

Inputs that predict YY in the same way should get the same representation. The KL divergence DKL(p(yx)p(yt))D_{\text{KL}}(p(y|x) \| p(y|t)) measures how much information about YY is lost by encoding xx as tt. The exponential form says: representations that lose little information are preferred, with β\beta controlling how much you care about information loss vs. compression.

Proof Sketch

Take the functional derivative of the Lagrangian I(X;T)βI(T;Y)I(X;T) - \beta I(T;Y) with respect to p(tx)p(t|x), subject to the normalization constraint tp(tx)=1\sum_t p(t|x) = 1. Setting the derivative to zero and solving gives the self-consistent equation. This is analogous to the Blahut-Arimoto algorithm for rate-distortion theory.

Why It Matters

The IB solution shows that optimal representations cluster inputs by their relevance to YY, not by their similarity in XX-space. Two images that look completely different but have the same label should get the same representation. This is a precise version of "learn features that matter for the task."

Failure Mode

The self-consistent equations require knowing p(x,y)p(x, y) exactly. In practice, we only have samples. Estimating mutual information from samples is notoriously difficult, especially in high dimensions. The IB solution also assumes a fixed β\beta, but choosing β\beta requires knowing the shape of the IB curve, which requires solving the IB problem for many β\beta values.

Theorem

Data Processing Inequality for Layer Representations

Statement

For a neural network with deterministic layers, the representations form a Markov chain YXT1T2TLY - X - T_1 - T_2 - \cdots - T_L. By the data processing inequality:

I(X;Y)I(T1;Y)I(T2;Y)I(TL;Y)I(X; Y) \geq I(T_1; Y) \geq I(T_2; Y) \geq \cdots \geq I(T_L; Y)

I(X;T1)I(X;T2)I(X;TL)I(X; T_1) \geq I(X; T_2) \geq \cdots \geq I(X; T_L)

Each layer can only lose information, never gain it.

Intuition

Processing data cannot create new information about YY. Each layer is a (deterministic or stochastic) function of the previous layer. Information about YY that is discarded at layer ll is gone forever. The best any subsequent layer can do is preserve what remains.

Proof Sketch

The data processing inequality states: if ABCA - B - C is a Markov chain, then I(A;C)I(A;B)I(A; C) \leq I(A; B). Apply this to the chain YTlTl+1Y - T_l - T_{l+1} for each ll to get the first set of inequalities. Apply it to XTlTl+1X - T_l - T_{l+1} for the second set.

Why It Matters

This inequality constrains what representations are possible. If a layer discards information about YY (for compression), subsequent layers cannot recover it. The network must decide what to keep and what to discard, and this decision is irreversible. The IB hypothesis says networks learn to make this decision optimally.

Failure Mode

For deterministic functions with continuous inputs, I(X;Tl)I(X; T_l) can be infinite (a deterministic invertible function preserves all information). This is the core of the criticism by Saxe et al. (2018): for ReLU networks without added noise, the mutual information between layers does not decrease during training. The compression observed by Shwartz-Ziv and Tishby (2017) was an artifact of the binning procedure used to estimate mutual information for networks with saturating (tanh) activations.

The Deep Learning Claim and Its Criticism

The Shwartz-Ziv and Tishby claim (2017): During DNN training, there are two phases. In the first (fitting) phase, I(Tl;Y)I(T_l; Y) increases across layers as the network learns to predict. In the second (compression) phase, I(X;Tl)I(X; T_l) decreases as the network discards irrelevant information. This compression phase is driven by SGD noise and is responsible for generalization.

The criticism (Saxe et al., 2018):

  1. ReLU networks do not compress. For networks with ReLU activations (which are piecewise linear and invertible on their support), the deterministic mapping Tl=fl(X)T_l = f_l(X) means I(X;Tl)=H(Tl)I(X; T_l) = H(T_l), which does not decrease during training. Compression was observed only for tanh networks.

  2. Binning artifact. The apparent compression in tanh networks was due to the binning estimator used to compute I(X;Tl)I(X; T_l). As tanh activations saturate during training, the binned estimates of I(X;Tl)I(X; T_l) decrease not because information is discarded but because the representations become concentrated in a few bins.

  3. Generalization without compression. ReLU networks generalize well without exhibiting the compression phase, contradicting the claim that compression is necessary for generalization.

What IB Does Explain

Despite the criticism, the IB framework provides useful conceptual tools:

  • Sufficient statistics: the optimal IB representation is a minimal sufficient statistic of XX for YY. This formalizes the intuition that good features capture everything relevant and nothing else.

  • Rate-distortion analogy: choosing a representation is like choosing a compression scheme. The IB curve is analogous to the rate-distortion curve in information theory.

  • Variational IB (VIB): Alemi et al. (2017) used a variational bound on the IB objective as a regularizer. This adds noise to the representation, making the compression measurable and often improving generalization. VIB is a practical algorithm, regardless of whether DNNs implicitly compress.

Common Confusions

Watch Out

IB is not the same as regularization by compression

IB says: minimize I(X;T)I(X; T) subject to preserving I(T;Y)I(T; Y). Standard regularization (e.g., weight decay, dropout) limits model complexity but does not directly optimize mutual information. IB is a principle about representations, not about parameters. The connection to standard regularization is indirect and still debated.

Watch Out

Mutual information is hard to estimate in high dimensions

Computing I(X;T)I(X; T) for a high-dimensional continuous TT (like the activations of a 512-dimensional hidden layer) is extremely difficult. Naive binning, KDE, and even modern neural estimators (MINE) have high variance or bias in this regime. Claims about I(X;T)I(X; T) during training should be treated skeptically unless the estimation method is carefully validated.

Watch Out

The data processing inequality does not say compression happens

The DPI says I(X;Tl)I(X; T_l) is non-increasing across layers. For deterministic invertible mappings, I(X;Tl)=I(X;X)=H(X)I(X; T_l) = I(X; X) = H(X) at every layer, so no compression occurs. The DPI sets an upper bound, not a description of what actually happens.

Exercises

ExerciseCore

Problem

In the IB objective minI(X;T)βI(T;Y)\min I(X; T) - \beta I(T; Y), what happens at the extremes: β0\beta \to 0 and β\beta \to \infty? What representation does the optimizer choose in each case?

ExerciseAdvanced

Problem

A two-layer neural network maps XR100X \in \mathbb{R}^{100} to T1R10T_1 \in \mathbb{R}^{10} to T2R2T_2 \in \mathbb{R}^{2} to Y^\hat{Y}. All layers use ReLU. Explain why I(X;T1)=H(T1)I(X; T_1) = H(T_1) and why this makes the compression phase claim problematic for this network.

References

Canonical:

  • Tishby, Pereira, Bialek, The Information Bottleneck Method (2000)
  • Shwartz-Ziv & Tishby, Opening the Black Box of Deep Neural Networks via Information (2017)

Current:

  • Saxe et al., On the Information Bottleneck Theory of Deep Learning (2018)
  • Alemi et al., Deep Variational Information Bottleneck (2017)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics