Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Learning Theory Core

Kolmogorov Complexity and MDL

Kolmogorov complexity measures the shortest program that produces a string. The Minimum Description Length principle selects models that compress data best, providing a computable approximation to an incomputable ideal.

CoreTier 2Stable~55 min

Prerequisites

0

Why This Matters

The connection between compression and learning is deep: a model that compresses data well must have captured its regularities, and those regularities are exactly what you need for prediction. Kolmogorov complexity gives the theoretical ideal for compression. MDL (Minimum Description Length) gives a practical principle for model selection based on the same idea.

This perspective offers an alternative to the classical bias-variance framework. Instead of asking "how complex is the hypothesis class?", you ask "how short is the total description of the data using this model?"

Kolmogorov Complexity

Definition

Kolmogorov Complexity

The Kolmogorov complexity of a string xx (with respect to a fixed universal Turing machine UU) is:

K(x)=min{p:U(p)=x}K(x) = \min\{|p| : U(p) = x\}

where p|p| is the length of program pp in bits. It is the length of the shortest program that outputs xx and halts.

Key properties:

  1. Subadditivity: K(x,y)K(x)+K(y)+O(log(min(K(x),K(y))))K(x, y) \leq K(x) + K(y) + O(\log(\min(K(x), K(y)))).
  2. Invariance: changing the universal Turing machine changes K(x)K(x) by at most an additive constant (independent of xx).
  3. Most strings are incompressible: for any length nn, at most 2nc2^{n-c} strings of length nn have K(x)ncK(x) \leq n - c. So at most a fraction 2c2^{-c} of strings can be compressed by cc bits.
Theorem

Incomputability of Kolmogorov Complexity

Statement

There is no algorithm that, given an arbitrary string xx, computes K(x)K(x).

Intuition

If you could compute K(x)K(x), you could enumerate programs by length and find the shortest one that produces xx. But this would let you solve the halting problem: to check if program pp halts, search for the shortest program producing pp's output and compare lengths. Since the halting problem is undecidable, K(x)K(x) must be incomputable.

Proof Sketch

Suppose KK is computable. Define a program PP that enumerates all strings xx with K(x)>nK(x) > n and outputs the first such string found. Program PP has length O(logn)O(\log n) (it just encodes the constant nn). But PP outputs a string with K(x)>nK(x) > n, while PP itself is a program of length O(logn)O(\log n) that produces xx. For large enough nn, this gives K(x)O(logn)<n<K(x)K(x) \leq O(\log n) < n < K(x), a contradiction.

Why It Matters

Incomputability is not a technicality. It means there is no algorithm that can determine the true complexity of data. Any practical approach to compression-based learning must use an approximation. MDL, BIC, and practical compression algorithms are all such approximations.

Failure Mode

You can compute upper bounds on K(x)K(x) (any specific compression of xx gives one), but you can never certify that your bound is tight. There is no algorithm that, given xx and kk, decides whether K(x)kK(x) \leq k.

Minimum Description Length

Definition

Minimum Description Length Principle

The MDL principle selects the model MM that minimizes the total description length:

L(M)+L(DM)L(M) + L(D \mid M)

where L(M)L(M) is the length (in bits) of encoding the model and L(DM)L(D \mid M) is the length of encoding the data DD given the model. This is a two-part code: first describe the model, then describe the data using the model.

The first term penalizes model complexity (more parameters require more bits). The second term penalizes poor fit (if the model does not explain the data well, the residuals require many bits). MDL balances these automatically.

Connection to coding theory. By the Shannon-Kraft inequality, code lengths correspond to probability distributions: L(x)=log2P(x)L(x) = -\log_2 P(x). So minimizing description length is equivalent to maximizing a penalized likelihood. MDL with a universal code for the model class recovers BIC asymptotically.

Theorem

MDL Consistency

Statement

Under a two-part MDL code with Kraft-valid code lengths for models, as the sample size nn \to \infty, the MDL-selected model converges (in probability) to the simplest model in the class that contains the true data-generating distribution.

Intuition

As nn grows, the data-fit term L(DM)L(D \mid M) dominates. Models that do not contain the true distribution pay a per-sample penalty that grows linearly with nn. Among models that do contain the true distribution, the model complexity term L(M)L(M) breaks ties in favor of the simplest one.

Proof Sketch

For any model MM that does not contain the true distribution PP^*, the average code length 1nL(DM)\frac{1}{n} L(D \mid M) converges to the cross-entropy H(P,PM)>H(P)H(P^*, P_M) > H(P^*). The true model achieves H(P)H(P^*) plus a O(d2nlogn)O(\frac{d}{2n} \log n) penalty term. For large nn, the excess cross-entropy of the wrong model exceeds the complexity penalty of the right one.

Why It Matters

This justifies MDL as a model selection principle: it is consistent (picks the right model eventually) and implements Occam's razor automatically. You do not need to specify a separate regularization parameter.

Failure Mode

Consistency is an asymptotic property. For finite samples, MDL can select a model that is too simple (if the complexity penalty overwhelms the data-fit term) or too complex (if the model class encoding is poorly chosen). The result also requires the true distribution to lie within the model class, which is never exactly true in practice.

Connection to Learning Theory

The compression view of learning: if a learning algorithm can compress the training data (describe it in fewer bits than the raw data), then the patterns it found are likely real and will generalize. More precisely, if a hypothesis hh allows encoding the training labels in L(h)+L(ynh,xn)L(h) + L(y^n \mid h, x^n) bits, and this is less than nn bits (the cost of encoding labels with no model), then hh has captured structure.

This gives generalization bounds: a model that compresses training data by cc bits has generalization error bounded roughly by (nc)/n\sqrt{(n - c) / n} times the trivial bound. Better compression implies better generalization.

Why MDL Is Hard to Operationalize

The elegance of MDL hides practical difficulties:

  1. Code design: the encoding of models and data given models is not unique. Different encodings lead to different model selections.
  2. Continuous parameters: encoding real-valued parameters requires discretization, and the grid resolution affects the result.
  3. Model class: MDL selects among a given collection of models. It cannot search over all possible models (that would require computing K(x)K(x)).

In practice, BIC and cross-validation are used more often than explicit MDL, partly because they avoid the coding construction.

Common Confusions

Watch Out

Kolmogorov complexity is not Shannon entropy

Shannon entropy H(X)H(X) is a property of a distribution. Kolmogorov complexity K(x)K(x) is a property of a specific string. For a random string drawn from distribution PP, E[K(X)]H(X)\mathbb{E}[K(X)] \approx H(X) (up to O(1)O(1)), but individual strings can have K(x)K(x) far from H(X)H(X).

Watch Out

MDL is not just penalized likelihood

While MDL can be viewed as penalized likelihood (since code lengths correspond to log-probabilities), the MDL philosophy is different. Penalized likelihood assumes a true model exists and tries to find it. MDL makes no such assumption: it seeks the best compression regardless of whether any model is "true." In the misspecified setting, MDL selects the model that compresses best, which may not be the most "likely" in any generative sense.

Canonical Examples

Example

MDL for polynomial regression

Suppose you fit polynomials of degree 1, 2, 5, and 20 to 50 data points. The degree-20 polynomial fits the training data perfectly (zero residuals, so L(DM)0L(D \mid M) \approx 0), but encoding 21 coefficients to high precision costs many bits (L(M)L(M) is large). A degree-2 polynomial has small L(M)L(M) (3 coefficients) but nonzero residuals that require some bits to encode. MDL selects the degree that minimizes the total: if the true relationship is quadratic, degree 2 wins for large enough nn.

Exercises

ExerciseCore

Problem

The string x=0101010101x = 01010101\ldots01 (the pattern "01" repeated 500 times, total length 1000 bits) has K(x)1000K(x) \ll 1000. Give an upper bound on K(x)K(x) and explain why. A random 1000-bit string has K(x)1000K(x) \approx 1000. Why?

ExerciseAdvanced

Problem

In two-part MDL, you encode a polynomial regression model by specifying the degree dd and the d+1d+1 coefficients. Suppose each coefficient is quantized to bb bits. Write the total description length as a function of dd, bb, nn (sample size), and the residual sum of squares RSS(d)\text{RSS}(d). Under what conditions does MDL prefer degree 2 over degree 10?

References

Canonical:

  • Li & Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications (2008), Chapters 1-2
  • Grunwald, The Minimum Description Length Principle (2007), Chapters 1-3, 17

Current:

  • Rissanen, "Modeling by Shortest Data Description" (1978)

  • Barron, Rissanen, Yu, "The Minimum Description Length Principle in Coding and Modeling" (1998)

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 2-6

  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapters 2-4

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics