Learning Theory Core
Kolmogorov Complexity and MDL
Kolmogorov complexity measures the shortest program that produces a string. The Minimum Description Length principle selects models that compress data best, providing a computable approximation to an incomputable ideal.
Prerequisites
Why This Matters
The connection between compression and learning is deep: a model that compresses data well must have captured its regularities, and those regularities are exactly what you need for prediction. Kolmogorov complexity gives the theoretical ideal for compression. MDL (Minimum Description Length) gives a practical principle for model selection based on the same idea.
This perspective offers an alternative to the classical bias-variance framework. Instead of asking "how complex is the hypothesis class?", you ask "how short is the total description of the data using this model?"
Kolmogorov Complexity
Kolmogorov Complexity
The Kolmogorov complexity of a string (with respect to a fixed universal Turing machine ) is:
where is the length of program in bits. It is the length of the shortest program that outputs and halts.
Key properties:
- Subadditivity: .
- Invariance: changing the universal Turing machine changes by at most an additive constant (independent of ).
- Most strings are incompressible: for any length , at most strings of length have . So at most a fraction of strings can be compressed by bits.
Incomputability of Kolmogorov Complexity
Statement
There is no algorithm that, given an arbitrary string , computes .
Intuition
If you could compute , you could enumerate programs by length and find the shortest one that produces . But this would let you solve the halting problem: to check if program halts, search for the shortest program producing 's output and compare lengths. Since the halting problem is undecidable, must be incomputable.
Proof Sketch
Suppose is computable. Define a program that enumerates all strings with and outputs the first such string found. Program has length (it just encodes the constant ). But outputs a string with , while itself is a program of length that produces . For large enough , this gives , a contradiction.
Why It Matters
Incomputability is not a technicality. It means there is no algorithm that can determine the true complexity of data. Any practical approach to compression-based learning must use an approximation. MDL, BIC, and practical compression algorithms are all such approximations.
Failure Mode
You can compute upper bounds on (any specific compression of gives one), but you can never certify that your bound is tight. There is no algorithm that, given and , decides whether .
Minimum Description Length
Minimum Description Length Principle
The MDL principle selects the model that minimizes the total description length:
where is the length (in bits) of encoding the model and is the length of encoding the data given the model. This is a two-part code: first describe the model, then describe the data using the model.
The first term penalizes model complexity (more parameters require more bits). The second term penalizes poor fit (if the model does not explain the data well, the residuals require many bits). MDL balances these automatically.
Connection to coding theory. By the Shannon-Kraft inequality, code lengths correspond to probability distributions: . So minimizing description length is equivalent to maximizing a penalized likelihood. MDL with a universal code for the model class recovers BIC asymptotically.
MDL Consistency
Statement
Under a two-part MDL code with Kraft-valid code lengths for models, as the sample size , the MDL-selected model converges (in probability) to the simplest model in the class that contains the true data-generating distribution.
Intuition
As grows, the data-fit term dominates. Models that do not contain the true distribution pay a per-sample penalty that grows linearly with . Among models that do contain the true distribution, the model complexity term breaks ties in favor of the simplest one.
Proof Sketch
For any model that does not contain the true distribution , the average code length converges to the cross-entropy . The true model achieves plus a penalty term. For large , the excess cross-entropy of the wrong model exceeds the complexity penalty of the right one.
Why It Matters
This justifies MDL as a model selection principle: it is consistent (picks the right model eventually) and implements Occam's razor automatically. You do not need to specify a separate regularization parameter.
Failure Mode
Consistency is an asymptotic property. For finite samples, MDL can select a model that is too simple (if the complexity penalty overwhelms the data-fit term) or too complex (if the model class encoding is poorly chosen). The result also requires the true distribution to lie within the model class, which is never exactly true in practice.
Connection to Learning Theory
The compression view of learning: if a learning algorithm can compress the training data (describe it in fewer bits than the raw data), then the patterns it found are likely real and will generalize. More precisely, if a hypothesis allows encoding the training labels in bits, and this is less than bits (the cost of encoding labels with no model), then has captured structure.
This gives generalization bounds: a model that compresses training data by bits has generalization error bounded roughly by times the trivial bound. Better compression implies better generalization.
Why MDL Is Hard to Operationalize
The elegance of MDL hides practical difficulties:
- Code design: the encoding of models and data given models is not unique. Different encodings lead to different model selections.
- Continuous parameters: encoding real-valued parameters requires discretization, and the grid resolution affects the result.
- Model class: MDL selects among a given collection of models. It cannot search over all possible models (that would require computing ).
In practice, BIC and cross-validation are used more often than explicit MDL, partly because they avoid the coding construction.
Common Confusions
Kolmogorov complexity is not Shannon entropy
Shannon entropy is a property of a distribution. Kolmogorov complexity is a property of a specific string. For a random string drawn from distribution , (up to ), but individual strings can have far from .
MDL is not just penalized likelihood
While MDL can be viewed as penalized likelihood (since code lengths correspond to log-probabilities), the MDL philosophy is different. Penalized likelihood assumes a true model exists and tries to find it. MDL makes no such assumption: it seeks the best compression regardless of whether any model is "true." In the misspecified setting, MDL selects the model that compresses best, which may not be the most "likely" in any generative sense.
Canonical Examples
MDL for polynomial regression
Suppose you fit polynomials of degree 1, 2, 5, and 20 to 50 data points. The degree-20 polynomial fits the training data perfectly (zero residuals, so ), but encoding 21 coefficients to high precision costs many bits ( is large). A degree-2 polynomial has small (3 coefficients) but nonzero residuals that require some bits to encode. MDL selects the degree that minimizes the total: if the true relationship is quadratic, degree 2 wins for large enough .
Exercises
Problem
The string (the pattern "01" repeated 500 times, total length 1000 bits) has . Give an upper bound on and explain why. A random 1000-bit string has . Why?
Problem
In two-part MDL, you encode a polynomial regression model by specifying the degree and the coefficients. Suppose each coefficient is quantized to bits. Write the total description length as a function of , , (sample size), and the residual sum of squares . Under what conditions does MDL prefer degree 2 over degree 10?
References
Canonical:
- Li & Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications (2008), Chapters 1-2
- Grunwald, The Minimum Description Length Principle (2007), Chapters 1-3, 17
Current:
-
Rissanen, "Modeling by Shortest Data Description" (1978)
-
Barron, Rissanen, Yu, "The Minimum Description Length Principle in Coding and Modeling" (1998)
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 2-6
-
Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapters 2-4
Next Topics
- AIC and BIC: alternative model selection criteria with connections to MDL
- Algorithmic stability: another route from simplicity to generalization
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- P vs NPLayer 0A
- Sorting AlgorithmsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A