Kolmogorov Complexity and MDL

Sneiderman, Robby

Learning Theory Core

Kolmogorov Complexity and MDL

Kolmogorov complexity measures the shortest program that produces a string. The Minimum Description Length principle selects models that compress data best, providing a computable approximation to an incomputable ideal.

CoreTier 2StableSupporting~55 min

Prerequisites

P vs Np

Start 8-question practice · 7 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

learning-theory-core | layer 2 | tier 2. This page has 0 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

AIC and BIC

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The connection between compression and learning is deep: a model that compresses data well must have captured its regularities, and those regularities are exactly what you need for prediction. Kolmogorov complexity gives the theoretical ideal for compression. MDL (Minimum Description Length) gives a practical principle for model selection based on the same idea.

This perspective offers an alternative to the classical bias-variance framework. Instead of asking "how complex is the hypothesis class?", you ask "how short is the total description of the data using this model?"

Kolmogorov Complexity

Definition

Kolmogorov Complexity $K (x)$

The Kolmogorov complexity of a string $x$ (with respect to a fixed universal Turing machine $U$ ) is:

$K(x) = \min\{|p| : U(p) = x\}$

where $|p|$ is the length of program $p$ in bits. It is the length of the shortest program that outputs $x$ and halts.

Key properties:

Subadditivity: $K(x, y) \leq K(x) + K(y) + O(\log(\min(K(x), K(y))))$ .
Invariance: changing the universal Turing machine changes $K(x)$ by at most an additive constant (independent of $x$ ).
Most strings are incompressible: for any length $n$ , at most $2^{n-c}$ strings of length $n$ have $K(x) \leq n - c$ . So at most a fraction $2^{-c}$ of strings can be compressed by $c$ bits.

Theorem

Incomputability of Kolmogorov Complexity

Statement

There is no algorithm that, given an arbitrary string $x$ , computes $K(x)$ .

Intuition

A computable $K$ would be self-defeating. Given any threshold $n$ , you could write a short program that searches through strings in lexicographic order and outputs the first one with $K(x) > n$ — call its output $x_n$ . By construction $K(x_n) > n$ , but the search program itself has length only $O(\log n) + c$ (it just stores the constant $n$ and a fixed search routine), which for large $n$ is far below $n$ . That contradicts $K(x_n) > n$ . This is the Berry paradox made formal, and it is the same self-reference that powers the undecidability of halting — but the argument is direct, not a reduction from halting.

Proof Sketch

Suppose $K$ is computable. Define a program $P$ that enumerates all strings $x$ with $K(x) > n$ and outputs the first such string found. Program $P$ has length $O(\log n)$ (it just encodes the constant $n$ ). But $P$ outputs a string with $K(x) > n$ , while $P$ itself is a program of length $O(\log n)$ that produces $x$ . For large enough $n$ , this gives $K(x) \leq O(\log n) < n < K(x)$ , a contradiction.

Why It Matters

Incomputability is not a technicality. It means there is no algorithm that can determine the true complexity of data. Any practical approach to compression-based learning must use an approximation. MDL, BIC, and practical compression algorithms are all such approximations.

Failure Mode

You can compute upper bounds on $K(x)$ (any specific compression of $x$ gives one), but you can never certify that your bound is tight. There is no algorithm that, given $x$ and $k$ , decides whether $K(x) \leq k$ .

report a correction →

Minimum Description Length

Definition

Minimum Description Length Principle

The MDL principle selects the model $M$ that minimizes the total description length:

$L(M) + L(D \mid M)$

where $L(M)$ is the length (in bits) of encoding the model and $L(D \mid M)$ is the length of encoding the data $D$ given the model. This is a two-part code: first describe the model, then describe the data using the model.

The first term penalizes model complexity (more parameters require more bits). The second term penalizes poor fit (if the model does not explain the data well, the residuals require many bits). MDL balances these automatically.

Connection to coding theory. By the Shannon-Kraft inequality, code lengths correspond to probability distributions: $L(x) = -\log_2 P(x)$ . So minimizing description length is equivalent to maximizing a penalized likelihood. MDL with a universal code for the model class recovers BIC asymptotically.

Theorem

MDL Consistency

Statement

Assume the true data-generating distribution $P^*$ lies in some model $M^*$ in the (countable, identifiable) model class, the model code satisfies the Kraft inequality $\sum_M 2^{-L(M)} \le 1$ , and $L(M^*) < \infty$ . Under a two-part MDL code, as the sample size $n \to \infty$ , the MDL-selected model converges (in probability) to the simplest model in the class that contains $P^*$ .

Intuition

Kraft alone is not enough: it only ensures the model code is decodable. The work is done by realizability and identifiability — a wrong model pays a per-sample cross-entropy excess that grows linearly with $n$ , while the correct model pays only the fixed cost $L(M^*)$ plus an $O(\log n)$ parameter-encoding term. For large $n$ the linear penalty dominates the logarithmic one, so wrong models lose. Among models that contain $P^*$ , the Kraft-constrained code length $L(M)$ breaks ties in favor of the simplest.

Proof Sketch

For any model $M$ that does not contain the true distribution $P^*$ , the average code length $\frac{1}{n} L(D \mid M)$ converges to the cross-entropy $H(P^*, P_M) > H(P^*)$ . The true model achieves $H(P^*)$ plus a $O(\frac{d}{2n} \log n)$ penalty term. For large $n$ , the excess cross-entropy of the wrong model exceeds the complexity penalty of the right one.

Why It Matters

This justifies MDL as a model selection principle: it is consistent (picks the right model eventually) and implements Occam's razor automatically. You do not need to specify a separate regularization parameter.

Failure Mode

Consistency is an asymptotic property. For finite samples, MDL can select a model that is too simple (if the complexity penalty overwhelms the data-fit term) or too complex (if the model class encoding is poorly chosen). The result also requires the true distribution to lie within the model class, which is never exactly true in practice.

report a correction →

Connection to Learning Theory

The compression view of learning: if a learning algorithm can compress the training data (describe it in fewer bits than the raw data), then the patterns it found are likely real and will generalize. More precisely, if a hypothesis $h$ allows encoding the training labels in $L(h) + L(y^n \mid h, x^n)$ bits, and this is less than $n$ bits (the cost of encoding labels with no model), then $h$ has captured structure.

This intuition is made precise by occam-style PAC bounds. For a finite or prefix-coded hypothesis class with code length $L(h)$ in bits, with probability at least $1 - \delta$ over the sample,

\text{err}(h) - \widehat{\text{err}}(h) \le O\!\left(\sqrt{\frac{L(h)\ln 2 + \ln(1/\delta)}{n}}\right)

in the agnostic setting (or the analogous $(L(h)\ln 2 + \ln(1/\delta))/n$ rate in the realizable case). What "compresses by $c$ bits" buys is a smaller $L(h)$ in the numerator, hence a tighter excess-risk bound. There is no general $\sqrt{(n-c)/n}$ rule: the right form is the Shawe-Taylor / Langford-style PAC bound above, with hypothesis cost measured by code length.

Why MDL Is Hard to Operationalize

The elegance of MDL hides practical difficulties:

Code design: the encoding of models and data given models is not unique. Different encodings lead to different model selections.
Continuous parameters: encoding real-valued parameters requires discretization, and the grid resolution affects the result.
Model class: MDL selects among a given collection of models. It cannot search over all possible models (that would require computing $K(x)$ ).

In practice, BIC and cross-validation are used more often than explicit MDL, partly because they avoid the coding construction.

Common Confusions

Watch Out

Kolmogorov complexity is not Shannon entropy

Shannon entropy $H(X)$ is a property of a distribution. Kolmogorov complexity $K(x)$ is a property of a specific string. For a random string drawn from distribution $P$ , $\mathbb{E}[K(X)] \approx H(X)$ (up to $O(1)$ ), but individual strings can have $K(x)$ far from $H(X)$ .

Watch Out

MDL is not just penalized likelihood

While MDL can be viewed as penalized likelihood (since code lengths correspond to log-probabilities), the MDL philosophy is different. Penalized likelihood assumes a true model exists and tries to find it. MDL makes no such assumption: it seeks the best compression regardless of whether any model is "true." In the misspecified setting, MDL selects the model that compresses best, which may not be the most "likely" in any generative sense.

Canonical Examples

Example

MDL for polynomial regression

Suppose you fit polynomials of degree 1, 2, 5, and 20 to 50 data points. The degree-20 polynomial fits the training data perfectly (zero residuals, so $L(D \mid M) \approx 0$ ), but encoding 21 coefficients to high precision costs many bits ( $L(M)$ is large). A degree-2 polynomial has small $L(M)$ (3 coefficients) but nonzero residuals that require some bits to encode. MDL selects the degree that minimizes the total: if the true relationship is quadratic, degree 2 wins for large enough $n$ .

Exercises

ExerciseCore

Problem

The string $x = 01010101\ldots01$ (the pattern "01" repeated 500 times, total length 1000 bits) has $K(x) \ll 1000$ . Give an upper bound on $K(x)$ and explain why. A random 1000-bit string has $K(x) \approx 1000$ . Why?

ExerciseAdvanced

Problem

In two-part MDL, you encode a polynomial regression model by specifying the degree $d$ and the $d+1$ coefficients. Suppose each coefficient is quantized to $b$ bits. Write the total description length as a function of $d$ , $b$ , $n$ (sample size), and the residual sum of squares $\text{RSS}(d)$ . Under what conditions does MDL prefer degree 2 over degree 10?

References

Canonical:

Li & Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications (2008), Chapters 1-2
Grunwald, The Minimum Description Length Principle (2007), Chapters 1-3, 17

Current:

Rissanen, "Modeling by Shortest Data Description" (1978)
Barron, Rissanen, Yu, "The Minimum Description Length Principle in Coding and Modeling" (1998)
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 2-6
Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapters 2-4

Next Topics

AIC and BIC: alternative model selection criteria with connections to MDL
Algorithmic stability: another route from simplicity to generalization

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

2

AIC and BIClayer 2 · tier 1
Algorithmic Stabilitylayer 3 · tier 1

Graph-backed continuations

AIC and BIC Algorithmic Stability