Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Birthday Paradox

In a group of 23 people, the probability that two share a birthday exceeds 50%. Pairwise collision counting explains why this threshold is so low.

CoreTier 2Stable~25 min

Why This Matters

The birthday paradox is the canonical example of pairwise collision probability being much larger than people expect. The same mathematics governs hash collisions, random sampling overlaps, and the security of cryptographic hash functions. If a hash function has NN possible outputs, you only need about N\sqrt{N} random inputs before a collision becomes likely. This is why 128-bit hashes are not considered collision-resistant: 2642^{64} attempts suffice.

Setup

Assume 365 equally likely birthdays (ignore leap years). Place kk people in a room. What is the probability that at least two share a birthday?

People guess this probability is small for k=23k = 23 because they anchor on the wrong question. They ask "what is the probability someone shares my birthday?" (which is about 22/3656%22/365 \approx 6\%) instead of "what is the probability any pair collides?" There are (232)=253\binom{23}{2} = 253 pairs, and the collision probability accumulates over all of them.

Definition

Birthday Collision Probability

The probability that among kk items drawn uniformly at random from NN categories, at least two items fall in the same category. Computed via the complement: P(k,N)=1i=0k1(1i/N)P(k, N) = 1 - \prod_{i=0}^{k-1}(1 - i/N).

Main Theorems

Theorem

Birthday Collision Threshold

Statement

The probability that at least two of kk items share a category is:

P(k,N)=1i=0k1(1iN)P(k, N) = 1 - \prod_{i=0}^{k-1}\left(1 - \frac{i}{N}\right)

For N=365N = 365, P(23,365)>0.5P(23, 365) > 0.5. More precisely, P(23,365)0.5073P(23, 365) \approx 0.5073.

Intuition

Each new person must avoid all previously seen birthdays. The first person has no constraint. The second must avoid 1/3651/365, the third must avoid 2/3652/365, and so on. These "small" probabilities compound multiplicatively, and the product (1i/365)\prod(1 - i/365) drops below 0.50.5 faster than intuition suggests because there are (k2)\binom{k}{2} pairs, which grows quadratically.

Proof Sketch

Compute the complement: the probability that all kk birthdays are distinct. Person 1 can have any birthday. Person 2 must avoid 1 birthday: probability (N1)/N(N-1)/N. Person ii must avoid i1i-1 birthdays: probability (Ni+1)/N(N - i + 1)/N. So:

P(all distinct)=NNN1NN2NNk+1N=i=0k1(1iN)P(\text{all distinct}) = \frac{N}{N} \cdot \frac{N-1}{N} \cdot \frac{N-2}{N} \cdots \frac{N-k+1}{N} = \prod_{i=0}^{k-1}\left(1 - \frac{i}{N}\right)

Using 1xex1 - x \leq e^{-x}:

P(all distinct)exp(i=0k1iN)=exp(k(k1)2N)P(\text{all distinct}) \leq \exp\left(-\sum_{i=0}^{k-1} \frac{i}{N}\right) = \exp\left(-\frac{k(k-1)}{2N}\right)

Setting this to 1/21/2 gives k2Nln21.177Nk \approx \sqrt{2N \ln 2} \approx 1.177\sqrt{N}. For N=365N = 365, this gives k22.5k \approx 22.5, confirming the threshold is 23.

Why It Matters

The N\sqrt{N} scaling is the key insight. It means collision resistance requires squared output space. A hash with 21282^{128} outputs is broken with 2642^{64} attempts, not 21282^{128}. This is the birthday attack in cryptography.

Failure Mode

The result assumes uniform distribution over categories. If birthdays (or hash outputs) are non-uniform, collisions happen sooner than the uniform case predicts. Non-uniformity only increases collision probability. The uniform case is the best case for collision avoidance.

Common Confusions

Watch Out

Matching MY birthday vs matching ANY birthday

The probability that someone in a room of 22 others shares your specific birthday is 1(364/365)225.9%1 - (364/365)^{22} \approx 5.9\%. But the probability that some pair among 23 people shares a birthday is 50.7%50.7\%. The difference is the number of pairs: 2222 vs (232)=253\binom{23}{2} = 253.

Watch Out

Linear vs quadratic growth of pairs

People think of 23 people as a "small" group. But 23 people produce 253 pairs. The paradox is not about the number of people; it is about the number of pairwise comparisons, which grows as O(k2)O(k^2).

Applications Beyond Birthdays

Hash Collisions in Data Structures

Example

Hash collision probability

A hash function produces 32-bit outputs (N=2324.3×109N = 2^{32} \approx 4.3 \times 10^9). After hashing k=77,163k = 77,163 distinct inputs, the collision probability exceeds 50%. This is 2232ln277,163\sqrt{2 \cdot 2^{32} \cdot \ln 2} \approx 77{,}163. For a 64-bit hash, the threshold is about 5.1×1095.1 \times 10^9 inputs.

Hash tables use the birthday paradox to set expectations for collision rates. A hash table with NN buckets and kk entries expects approximately k2/(2N)k^2/(2N) collisions (for small collision probability). Choosing Nk2N \gg k^2 ensures few collisions, but requires more memory. The standard rule of thumb (load factor <0.75< 0.75) is a practical compromise informed by this analysis.

Random Sampling and Coupon Collection

When sampling with replacement from a population of NN items, the birthday paradox tells you that after O(N)O(\sqrt{N}) draws, you will likely see a duplicate. This arises in bootstrapping: a bootstrap sample of size nn drawn with replacement from nn items has approximately 11/e63.2%1 - 1/e \approx 63.2\% unique items, because each item has probability (11/n)ne1(1 - 1/n)^n \approx e^{-1} of never being selected.

Cryptographic Birthday Attacks

A birthday attack finds two inputs with the same hash output. For a hash with bb-bit outputs, the attack requires O(2b/2)O(2^{b/2}) evaluations, not O(2b)O(2^b). This is why secure hash functions use at least 256-bit outputs: 21282^{128} operations is considered computationally infeasible, while 2642^{64} is borderline feasible.

Example

UUID collision probability

Version 4 UUIDs have 122 random bits (N=2122N = 2^{122}). Using the birthday approximation, the number of UUIDs needed for a 50% collision probability is 22122ln2261.52.7×1018\sqrt{2 \cdot 2^{122} \cdot \ln 2} \approx 2^{61.5} \approx 2.7 \times 10^{18}. If you generate 1 billion UUIDs per second, you would need about 86 years to reach this threshold. For practical purposes, UUID collisions are not a concern.

Duplicate Detection in ML Datasets

Web-scraped training datasets frequently contain near-duplicates. The birthday paradox quantifies how likely this is. A corpus of kk documents, each represented by a random NN-bit hash (via MinHash or SimHash), will contain spurious hash collisions at rate k2/(2N)k^2/(2N). Deduplication pipelines use multi-probe hashing to detect these, with the birthday paradox setting the false positive rate.

ExerciseCore

Problem

A database assigns random 64-bit IDs to each record. After inserting 1 million records, what is the approximate probability of an ID collision?

Exercises

ExerciseCore

Problem

How many people do you need in a room for the probability of a shared birthday to exceed 99%?

ExerciseAdvanced

Problem

A system generates random 128-bit session tokens. After how many tokens does the probability of a collision exceed 10610^{-6}?

References

Canonical:

  • Feller, An Introduction to Probability Theory and Its Applications, Vol. 1, Chapter 2
  • Mitzenmacher & Upfal, Probability and Computing (2005), Chapter 1

Current:

  • Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms (2022), Section on hash functions

  • Munkres, Topology (2000), Chapter 1 (set theory review)

Next Topics

  • Monty Hall problem: another probability puzzle where intuition fails
  • Base-rate fallacy: conditional probability errors in diagnostic testing

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics