Information Bottleneck

Sneiderman, Robby

Modern Generalization

Information Bottleneck

The information bottleneck principle: compress the input X into a representation T that preserves information about the target Y. The Lagrangian formulation, connection to deep learning, Shwartz-Ziv and Tishby claims, and why the compression story may not hold for ReLU networks.

AdvancedTier 3CurrentSupporting~50 min

Prerequisites

Information Theory Foundations

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

modern-generalization | layer 3 | tier 3. This page has 1 direct prerequisite and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Implicit Bias and Modern Generalization

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The information bottleneck (IB) principle asks: what is the optimal way to compress an input $X$ into a representation $T$ while preserving as much information as possible about a target $Y$ ? This is a precise mathematical formulation of what representation learning should do.

Tishby and collaborators proposed that deep neural networks implicitly perform information bottleneck optimization: each layer compresses the input while retaining task-relevant information. If true, this would provide a principled explanation for why deep networks generalize. The claim generated significant excitement, followed by significant pushback. Understanding both the theory and the criticisms is necessary for evaluating what IB does and does not explain about deep learning.

Mental Model

You have a signal $X$ and a label $Y$ . You want to summarize $X$ into a shorter description $T$ . The description should throw away everything about $X$ that is irrelevant to predicting $Y$ , keeping only the relevant bits.

The information bottleneck formalizes "relevant" as mutual information $I(T; Y)$ and "short" as mutual information $I(X; T)$ . You want $I(T; Y)$ large and $I(X; T)$ small.

Formal Setup

Definition

Information Bottleneck Objective

Given random variables $X$ (input) and $Y$ (target) with joint distribution $p(x, y)$ , find a stochastic mapping $p(t | x)$ from $X$ to a representation $T$ that solves:

$\min_{p(t|x)} \; I(X; T) - \beta \, I(T; Y)$

where $\beta > 0$ is a Lagrange multiplier controlling the tradeoff between compression and prediction. The Markov chain is $Y - X - T$ : the representation $T$ depends on $Y$ only through $X$ .

Definition

Information Plane

The information plane is a 2D plot with $I(X; T)$ on the horizontal axis and $I(T; Y)$ on the vertical axis. Each representation $T$ corresponds to a point. The IB curve traces the Pareto frontier: the maximum $I(T; Y)$ achievable for each value of $I(X; T)$ .

In the deep learning context, each layer $l$ of a neural network defines a representation $T_l$ , giving a point $(I(X; T_l), I(T_l; Y))$ in the information plane.

Main Theorems

Theorem

Information Bottleneck Optimality Conditions

Statement

The optimal encoder $p(t | x)$ for the IB objective satisfies the self-consistent equations:

$p(t | x) = \frac{p(t)}{Z(x, \beta)} \exp\left(-\beta \, D_{\text{KL}}(p(y|x) \| p(y|t))\right)$

where $Z(x, \beta)$ is a normalization constant and $p(y | t) = \sum_x p(y|x) p(x|t)$ . The solution maps inputs $x$ that have similar conditional distributions $p(y|x)$ to the same representation $t$ .

Intuition

Inputs that predict $Y$ in the same way should get the same representation. The KL divergence $D_{\text{KL}}(p(y|x) \| p(y|t))$ measures how much information about $Y$ is lost by encoding $x$ as $t$ . The exponential form says: representations that lose little information are preferred, with $\beta$ controlling how much you care about information loss vs. compression.

Proof Sketch

Take the functional derivative of the Lagrangian $I(X;T) - \beta I(T;Y)$ with respect to $p(t|x)$ , subject to the normalization constraint $\sum_t p(t|x) = 1$ . Setting the derivative to zero and solving gives the self-consistent equation. This is analogous to the Blahut-Arimoto algorithm for rate-distortion theory.

Why It Matters

The IB solution shows that optimal representations cluster inputs by their relevance to $Y$ , not by their similarity in $X$ -space. Two images that look completely different but have the same label should get the same representation. This is a precise version of "learn features that matter for the task."

Failure Mode

The self-consistent equations require knowing $p(x, y)$ exactly. In practice, we only have samples. Estimating mutual information from samples is notoriously difficult, especially in high dimensions. The IB solution also assumes a fixed $\beta$ , but choosing $\beta$ requires knowing the shape of the IB curve, which requires solving the IB problem for many $\beta$ values.

report a correction →

Theorem

Data Processing Inequality for Layer Representations

Statement

For a neural network with deterministic layers, the representations form a Markov chain $Y - X - T_1 - T_2 - \cdots - T_L$ . By the data processing inequality:

$I(X; Y) \geq I(T_1; Y) \geq I(T_2; Y) \geq \cdots \geq I(T_L; Y)$

$I(X; T_1) \geq I(X; T_2) \geq \cdots \geq I(X; T_L)$

Each layer can only lose information, never gain it.

Intuition

Processing data cannot create new information about $Y$ . Each layer is a (deterministic or stochastic) function of the previous layer. Information about $Y$ that is discarded at layer $l$ is gone forever. The best any subsequent layer can do is preserve what remains.

Proof Sketch

The data processing inequality states: if $A - B - C$ is a Markov chain, then $I(A; C) \leq I(A; B)$ . Apply this to the chain $Y - T_l - T_{l+1}$ for each $l$ to get the first set of inequalities. Apply it to $X - T_l - T_{l+1}$ for the second set.

Why It Matters

This inequality constrains what representations are possible. If a layer discards information about $Y$ (for compression), subsequent layers cannot recover it. The network must decide what to keep and what to discard, and this decision is irreversible. The IB hypothesis says networks learn to make this decision optimally.

Failure Mode

For deterministic functions with continuous inputs, $I(X; T_l)$ can be infinite (a deterministic invertible function preserves all information). This is the core of the criticism by Saxe et al. (2018): for ReLU networks without added noise, the mutual information between layers does not decrease during training. The compression observed by Shwartz-Ziv and Tishby (2017) was an artifact of the binning procedure used to estimate mutual information for networks with saturating (tanh) activations.

report a correction →

The Deep Learning Claim and Its Criticism

The Shwartz-Ziv and Tishby claim (2017): During DNN training, there are two phases. In the first (fitting) phase, $I(T_l; Y)$ increases across layers as the network learns to predict. In the second (compression) phase, $I(X; T_l)$ decreases as the network discards irrelevant information. This compression phase is driven by SGD noise and is responsible for generalization.

The criticism (Saxe et al., 2018):

ReLU networks do not compress. For deterministic continuous networks with continuous input $X$ , $I(X; T_l)$ is, strictly speaking, ill-defined in Shannon's framework (the joint law $(X, T_l)$ is supported on the graph of $f_l$ , so $I(X; T_l)$ is either infinite or undefined). The identity $I(X; T_l) = H(T_l)$ only holds in a discrete setting (e.g., after binning, or with discrete weights) and uses Shannon entropy of the binned representation, not differential entropy. With or without binning, $I(X; T_l)$ for ReLU networks does not decrease during training in the way Shwartz-Ziv and Tishby's compression-phase claim requires.
Binning artifact. The apparent compression in tanh networks was due to the binning estimator used to compute $I(X; T_l)$ . As tanh activations saturate during training, the binned estimates of $I(X; T_l)$ decrease not because information is discarded but because the representations become concentrated in a few bins.
Generalization without compression. ReLU networks generalize well without exhibiting the compression phase, contradicting the claim that compression is necessary for generalization.

What IB Does Explain

Despite the criticism, the IB framework provides useful conceptual tools:

Sufficient statistics: in the limit $\beta \to \infty$ along the IB curve, the optimal representation recovers a minimal sufficient statistic of $X$ for $Y$ (when one exists). At finite $\beta$ , the IB optimizer is a soft relaxation: it trades off relevance for compression and is generally not sufficient. The minimal-sufficient-statistic interpretation is a limiting / tie-breaking statement, not a description of the IB solution at arbitrary $\beta$ .
Rate-distortion analogy: choosing a representation is like choosing a compression scheme. The IB curve is analogous to the rate-distortion curve in information theory.
Variational IB (VIB): Alemi et al. (2017) used a variational bound on the IB objective as a regularizer. This adds noise to the representation, making the compression measurable and often improving generalization. VIB is a practical algorithm, regardless of whether DNNs implicitly compress.

Common Confusions

Watch Out

IB is not the same as regularization by compression

IB says: minimize $I(X; T)$ subject to preserving $I(T; Y)$ . Standard regularization (e.g., weight decay, dropout) limits model complexity but does not directly optimize mutual information. IB is a principle about representations, not about parameters. The connection to standard regularization is indirect and still debated.

Watch Out

Mutual information is hard to estimate in high dimensions

Computing $I(X; T)$ for a high-dimensional continuous $T$ (like the activations of a 512-dimensional hidden layer) is extremely difficult. Naive binning, KDE, and even modern neural estimators (MINE) have high variance or bias in this regime. Claims about $I(X; T)$ during training should be treated skeptically unless the estimation method is carefully validated.

Watch Out

The data processing inequality does not say compression happens

The DPI says $I(X; T_l)$ is non-increasing across layers. For deterministic invertible mappings, $I(X; T_l) = I(X; X) = H(X)$ at every layer, so no compression occurs. The DPI sets an upper bound, not a description of what actually happens.

Exercises

ExerciseCore

Problem

In the IB objective $\min I(X; T) - \beta I(T; Y)$ , what happens at the extremes: $\beta \to 0$ and $\beta \to \infty$ ? What representation does the optimizer choose in each case?

ExerciseAdvanced

Problem

A two-layer neural network maps $X \in \mathbb{R}^{100}$ to $T_1 \in \mathbb{R}^{10}$ to $T_2 \in \mathbb{R}^{2}$ to $\hat{Y}$ . All layers use ReLU. Explain why, in the discrete (binned) setting, $I(X; T_1) = H(T_1)$ for this network, why $I(X; T_1)$ is ill-defined as a Shannon quantity in the continuous setting, and why both observations make the compression phase claim problematic for this network.

References

Canonical:

Tishby, Pereira, Bialek, The Information Bottleneck Method (2000)
Shwartz-Ziv & Tishby, Opening the Black Box of Deep Neural Networks via Information (2017)

Current:

Saxe et al., On the Information Bottleneck Theory of Deep Learning (2018)
Alemi et al., Deep Variational Information Bottleneck (2017)

Next Topics

Implicit bias and modern generalization: alternative explanations for why DNNs generalize

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Information Theory Foundationslayer 0B · tier 2

Derived topics

1

Implicit Bias and Modern Generalizationlayer 4 · tier 1

Graph-backed continuations

Implicit Bias and Modern Generalization