Foundations
Counting and Combinatorics
Counting principles, binomial and multinomial coefficients, inclusion-exclusion, and Stirling's approximation. These tools appear whenever you count hypotheses, bound shattering coefficients, or analyze combinatorial arguments in learning theory.
Why This Matters
Combinatorics appears throughout learning theory in unexpected places.
VC dimension theory counts the number of labelings a hypothesis class can produce on points. The Sauer-Shelah lemma bounds this by . Bounding this sum requires knowing how binomial coefficients behave.
Union bounds over hypothesis classes require counting. Covering number arguments require counting. The PAC learning sample complexity formula has terms that you simplify using Stirling's approximation.
Counting Principles
Product rule: if task A has outcomes and task B has outcomes, the pair (A, B) has outcomes. This extends to tasks.
Sum rule: if tasks A and B are mutually exclusive, the total outcomes are .
Permutations: the number of ordered arrangements of items from is .
Core Definitions
Binomial Coefficient
The binomial coefficient counts the number of ways to choose items from without regard to order:
for . By convention, if or .
Key properties:
- Symmetry:
- Pascal's rule:
- Sum:
Multinomial Coefficient
The number of ways to partition objects into groups of sizes (where ) is:
Main Theorems
Binomial Theorem
Statement
For any and non-negative integer :
Intuition
Expand . Each term in the expansion picks either or from each of the factors. The term appears once for each way to choose factors to contribute , which is .
Proof Sketch
Induction on . Base case : both sides equal 1. Inductive step: . Distribute and reindex using Pascal's rule.
Why It Matters
Setting gives : a set of size has subsets. Setting gives : the number of even-size subsets equals the number of odd-size subsets. These identities appear in VC theory and randomized arguments.
Failure Mode
The formula is exact for integer . For non-integer or negative , use the generalized binomial series, which is an infinite series and requires convergence conditions.
Inclusion-Exclusion
Inclusion-Exclusion Principle
For finite sets :
This alternating sum corrects for overcounting at each level.
In probability, replace with :
The union bound is the first term of inclusion-exclusion, which overcounts. Bonferroni inequalities give alternating upper and lower bounds by truncating at different levels.
Stirling's Approximation
Stirling Approximation
Statement
The cruder but often sufficient form is:
Intuition
The factorial grows faster than any exponential but slower than . Stirling's formula pins down the growth rate: is approximately times a polynomial correction.
Proof Sketch
Write and approximate this sum by the integral . The factor comes from a more refined analysis using the Euler-Maclaurin formula or the Laplace method applied to the Gamma function integral.
Why It Matters
Stirling lets you simplify binomial coefficients. For example: where is the binary entropy. This connection between counting and entropy appears throughout information theory and learning theory.
Failure Mode
The approximation is asymptotic; for small the relative error can be noticeable. For , Stirling gives vs the true value of 1. For , the relative error is below 1%.
Applications in Machine Learning Theory
Hypothesis Counting and VC Theory
The Sauer-Shelah lemma bounds the growth function of a hypothesis class with VC dimension :
For fixed and growing , this sum is , a polynomial bound. Without this combinatorial bound, the growth function could be as large as (all possible labelings), making uniform convergence arguments impossible. The transition from exponential to polynomial growth is the reason finite VC dimension implies learnability.
To see why: (each term is at most , and there are terms). The PAC learning sample complexity bound then becomes , which is polynomial in , , and .
Model Selection and Counting Arguments
When comparing models via a held-out validation set, the probability that the best model on the validation set is also the best model on unseen data depends on . A union bound over candidates gives:
This means the validation set size must grow as to maintain the same guarantee. With hyperparameter configurations, you need times more validation data than with a single model. The dependence comes from the union bound, which is a counting argument.
Multinomial Coefficients in Text Modeling
In the multinomial naive Bayes model, the probability of a document with word counts is:
The multinomial coefficient counts the number of orderings that produce the same bag of words. In practice, this coefficient cancels when comparing class posteriors (it does not depend on the class label), but understanding its origin clarifies why naive Bayes treats word order as irrelevant.
Counting binary classifiers on n points
A binary classifier on points assigns each point a label in . The total number of possible labelings is . For 20 data points, . A hypothesis class that can produce all of these labelings has VC dimension at least 20 and will overfit with fewer than training points.
A linear classifier in can shatter at most points. In , the VC dimension is 11, and the number of achievable labelings on 20 points is at most , which is about 34% of all possible labelings. This restriction is what makes learning possible: the hypothesis class cannot memorize arbitrary patterns.
Common Confusions
Binomial coefficients grow polynomially for fixed k
when is fixed and grows. But when grows with (say ), is exponential: . The Sauer-Shelah bound is polynomial in for fixed , which is the whole point.
Exercises
Problem
Prove that using a combinatorial argument.
Problem
Use Stirling's approximation to show that for , where is the binary entropy (using natural log).
References
Canonical:
- Flajolet & Sedgewick, Analytic Combinatorics (2009), Chapter I
- Graham, Knuth, Patashnik, Concrete Mathematics (1994), Chapters 5-6
Current:
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Section 6.5 (Sauer-Shelah and combinatorial arguments)
-
Munkres, Topology (2000), Chapter 1 (set theory review)
Last reviewed: April 2026