Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Counting and Combinatorics

Counting principles, binomial and multinomial coefficients, inclusion-exclusion, and Stirling's approximation. These tools appear whenever you count hypotheses, bound shattering coefficients, or analyze combinatorial arguments in learning theory.

CoreTier 2Stable~35 min

Why This Matters

Combinatorics appears throughout learning theory in unexpected places.

VC dimension theory counts the number of labelings a hypothesis class can produce on nn points. The Sauer-Shelah lemma bounds this by i=0d(ni)\sum_{i=0}^{d} \binom{n}{i}. Bounding this sum requires knowing how binomial coefficients behave.

Union bounds over hypothesis classes require counting. Covering number arguments require counting. The PAC learning sample complexity formula has log(nk)\log \binom{n}{k} terms that you simplify using Stirling's approximation.

Counting Principles

Product rule: if task A has mm outcomes and task B has nn outcomes, the pair (A, B) has mnmn outcomes. This extends to kk tasks.

Sum rule: if tasks A and B are mutually exclusive, the total outcomes are m+nm + n.

Permutations: the number of ordered arrangements of kk items from nn is n!/(nk)!=n(n1)(nk+1)n!/(n-k)! = n(n-1)\cdots(n-k+1).

Core Definitions

Definition

Binomial Coefficient

The binomial coefficient counts the number of ways to choose kk items from nn without regard to order:

(nk)=n!k!(nk)!\binom{n}{k} = \frac{n!}{k!(n-k)!}

for 0kn0 \leq k \leq n. By convention, (nk)=0\binom{n}{k} = 0 if k<0k < 0 or k>nk > n.

Key properties:

  • Symmetry: (nk)=(nnk)\binom{n}{k} = \binom{n}{n-k}
  • Pascal's rule: (nk)=(n1k1)+(n1k)\binom{n}{k} = \binom{n-1}{k-1} + \binom{n-1}{k}
  • Sum: k=0n(nk)=2n\sum_{k=0}^n \binom{n}{k} = 2^n
Definition

Multinomial Coefficient

The number of ways to partition nn objects into groups of sizes k1,,krk_1, \ldots, k_r (where k1++kr=nk_1 + \cdots + k_r = n) is:

(nk1,k2,,kr)=n!k1!k2!kr!\binom{n}{k_1, k_2, \ldots, k_r} = \frac{n!}{k_1! k_2! \cdots k_r!}

Main Theorems

Theorem

Binomial Theorem

Statement

For any a,bRa, b \in \mathbb{R} and non-negative integer nn:

(a+b)n=k=0n(nk)akbnk( a + b)^n = \sum_{k=0}^{n} \binom{n}{k} a^k b^{n-k}

Intuition

Expand (a+b)n=(a+b)(a+b)(a+b)(a+b)^n = (a+b)(a+b)\cdots(a+b). Each term in the expansion picks either aa or bb from each of the nn factors. The term akbnka^k b^{n-k} appears once for each way to choose kk factors to contribute aa, which is (nk)\binom{n}{k}.

Proof Sketch

Induction on nn. Base case n=0n=0: both sides equal 1. Inductive step: (a+b)n+1=(a+b)(a+b)n=(a+b)k(nk)akbnk(a+b)^{n+1} = (a+b)(a+b)^n = (a+b)\sum_k \binom{n}{k}a^k b^{n-k}. Distribute and reindex using Pascal's rule.

Why It Matters

Setting a=b=1a = b = 1 gives 2n=k(nk)2^n = \sum_k \binom{n}{k}: a set of size nn has 2n2^n subsets. Setting a=1,b=1a = 1, b = -1 gives 0=k(1)k(nk)0 = \sum_k (-1)^k \binom{n}{k}: the number of even-size subsets equals the number of odd-size subsets. These identities appear in VC theory and randomized arguments.

Failure Mode

The formula is exact for integer n0n \geq 0. For non-integer or negative nn, use the generalized binomial series, which is an infinite series and requires convergence conditions.

Inclusion-Exclusion

Definition

Inclusion-Exclusion Principle

For finite sets A1,,AnA_1, \ldots, A_n:

i=1nAi=iAii<jAiAj+i<j<kAiAjAk+(1)n+1A1An\left|\bigcup_{i=1}^n A_i\right| = \sum_{i} |A_i| - \sum_{i < j} |A_i \cap A_j| + \sum_{i < j < k} |A_i \cap A_j \cap A_k| - \cdots + (-1)^{n+1} |A_1 \cap \cdots \cap A_n|

This alternating sum corrects for overcounting at each level.

In probability, replace A|A| with P(A)P(A):

P(i=1nAi)=iP(Ai)i<jP(AiAj)+P\left(\bigcup_{i=1}^n A_i\right) = \sum_i P(A_i) - \sum_{i<j} P(A_i \cap A_j) + \cdots

The union bound P(Ai)P(Ai)P(\bigcup A_i) \leq \sum P(A_i) is the first term of inclusion-exclusion, which overcounts. Bonferroni inequalities give alternating upper and lower bounds by truncating at different levels.

Stirling's Approximation

Theorem

Stirling Approximation

Statement

n!=2πn(ne)n(1+O(1n))n! = \sqrt{2\pi n} \left(\frac{n}{e}\right)^n \left(1 + O\left(\frac{1}{n}\right)\right)

The cruder but often sufficient form is:

log(n!)=nlognn+O(logn)\log(n!) = n \log n - n + O(\log n)

Intuition

The factorial grows faster than any exponential but slower than nnn^n. Stirling's formula pins down the growth rate: n!n! is approximately (n/e)n(n/e)^n times a polynomial correction.

Proof Sketch

Write log(n!)=k=1nlogk\log(n!) = \sum_{k=1}^n \log k and approximate this sum by the integral 1nlogxdx=nlognn+1\int_1^n \log x \, dx = n \log n - n + 1. The 2πn\sqrt{2\pi n} factor comes from a more refined analysis using the Euler-Maclaurin formula or the Laplace method applied to the Gamma function integral.

Why It Matters

Stirling lets you simplify binomial coefficients. For example: log(nk)nH(k/n)\log \binom{n}{k} \approx n H(k/n) where H(p)=plogp(1p)log(1p)H(p) = -p \log p - (1-p) \log(1-p) is the binary entropy. This connection between counting and entropy appears throughout information theory and learning theory.

Failure Mode

The approximation is asymptotic; for small nn the relative error can be noticeable. For n=1n = 1, Stirling gives 2π(1/e)0.922\sqrt{2\pi} \cdot (1/e) \approx 0.922 vs the true value of 1. For n10n \geq 10, the relative error is below 1%.

Applications in Machine Learning Theory

Hypothesis Counting and VC Theory

The Sauer-Shelah lemma bounds the growth function of a hypothesis class H\mathcal{H} with VC dimension dd:

ΠH(n)i=0d(ni)\Pi_{\mathcal{H}}(n) \leq \sum_{i=0}^{d} \binom{n}{i}

For fixed dd and growing nn, this sum is O(nd)O(n^d), a polynomial bound. Without this combinatorial bound, the growth function could be as large as 2n2^n (all possible labelings), making uniform convergence arguments impossible. The transition from exponential to polynomial growth is the reason finite VC dimension implies learnability.

To see why: i=0d(ni)(n+1)d\sum_{i=0}^{d} \binom{n}{i} \leq (n+1)^d (each term is at most nd/d!n^d / d!, and there are d+1d+1 terms). The PAC learning sample complexity bound then becomes n=O((dlog(d/ϵ)+log(1/δ))/ϵ)n = O((d \log(d/\epsilon) + \log(1/\delta)) / \epsilon), which is polynomial in dd, 1/ϵ1/\epsilon, and log(1/δ)\log(1/\delta).

Model Selection and Counting Arguments

When comparing MM models via a held-out validation set, the probability that the best model on the validation set is also the best model on unseen data depends on MM. A union bound over MM candidates gives:

P(any model overfits by >ϵ)Me2nϵ2P(\text{any model overfits by } > \epsilon) \leq M \cdot e^{-2n\epsilon^2}

This means the validation set size must grow as logM\log M to maintain the same guarantee. With M=100M = 100 hyperparameter configurations, you need log(100)/log(10)2\log(100)/\log(10) \approx 2 times more validation data than with a single model. The logM\log M dependence comes from the union bound, which is a counting argument.

Multinomial Coefficients in Text Modeling

In the multinomial naive Bayes model, the probability of a document with word counts (k1,,kV)(k_1, \ldots, k_V) is:

P(doc)=n!k1!kV!j=1VpjkjP(\text{doc}) = \frac{n!}{k_1! \cdots k_V!} \prod_{j=1}^{V} p_j^{k_j}

The multinomial coefficient n!/(k1!kV!)n!/(k_1! \cdots k_V!) counts the number of orderings that produce the same bag of words. In practice, this coefficient cancels when comparing class posteriors (it does not depend on the class label), but understanding its origin clarifies why naive Bayes treats word order as irrelevant.

Example

Counting binary classifiers on n points

A binary classifier on nn points assigns each point a label in {0,1}\{0, 1\}. The total number of possible labelings is 2n2^n. For 20 data points, 220=1,048,5762^{20} = 1,048,576. A hypothesis class that can produce all of these labelings has VC dimension at least 20 and will overfit with fewer than O(d/ϵ)O(d/\epsilon) training points.

A linear classifier in Rd\mathbb{R}^d can shatter at most d+1d+1 points. In R10\mathbb{R}^{10}, the VC dimension is 11, and the number of achievable labelings on 20 points is at most i=011(20i)=352,716\sum_{i=0}^{11} \binom{20}{i} = 352,716, which is about 34% of all 2202^{20} possible labelings. This restriction is what makes learning possible: the hypothesis class cannot memorize arbitrary patterns.

Common Confusions

Watch Out

Binomial coefficients grow polynomially for fixed k

(nk)=Θ(nk/k!)\binom{n}{k} = \Theta(n^k / k!) when kk is fixed and nn grows. But when kk grows with nn (say k=n/2k = n/2), (nk)\binom{n}{k} is exponential: (nn/2)2n/πn/2\binom{n}{n/2} \approx 2^n / \sqrt{\pi n / 2}. The Sauer-Shelah bound i=0d(ni)\sum_{i=0}^d \binom{n}{i} is polynomial in nn for fixed dd, which is the whole point.

Exercises

ExerciseCore

Problem

Prove that k=0n(nk)2=(2nn)\sum_{k=0}^{n} \binom{n}{k}^2 = \binom{2n}{n} using a combinatorial argument.

ExerciseAdvanced

Problem

Use Stirling's approximation to show that log(npn)nH(p)\log \binom{n}{\lfloor pn \rfloor} \approx nH(p) for p(0,1)p \in (0,1), where H(p)=plogp(1p)log(1p)H(p) = -p\log p - (1-p)\log(1-p) is the binary entropy (using natural log).

References

Canonical:

  • Flajolet & Sedgewick, Analytic Combinatorics (2009), Chapter I
  • Graham, Knuth, Patashnik, Concrete Mathematics (1994), Chapters 5-6

Current:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Section 6.5 (Sauer-Shelah and combinatorial arguments)

  • Munkres, Topology (2000), Chapter 1 (set theory review)

Last reviewed: April 2026

Next Topics