Foundations
Birthday Paradox
In a group of 23 people, the probability that two share a birthday exceeds 50%. Pairwise collision counting explains why this threshold is so low.
Prerequisites
Why This Matters
The birthday paradox is the canonical example of pairwise collision probability being much larger than people expect. The same mathematics governs hash collisions, random sampling overlaps, and the security of cryptographic hash functions. If a hash function has possible outputs, you only need about random inputs before a collision becomes likely. This is why 128-bit hashes are not considered collision-resistant: attempts suffice.
Setup
Assume 365 equally likely birthdays (ignore leap years). Place people in a room. What is the probability that at least two share a birthday?
People guess this probability is small for because they anchor on the wrong question. They ask "what is the probability someone shares my birthday?" (which is about ) instead of "what is the probability any pair collides?" There are pairs, and the collision probability accumulates over all of them.
Birthday Collision Probability
The probability that among items drawn uniformly at random from categories, at least two items fall in the same category. Computed via the complement: .
Main Theorems
Birthday Collision Threshold
Statement
The probability that at least two of items share a category is:
For , . More precisely, .
Intuition
Each new person must avoid all previously seen birthdays. The first person has no constraint. The second must avoid , the third must avoid , and so on. These "small" probabilities compound multiplicatively, and the product drops below faster than intuition suggests because there are pairs, which grows quadratically.
Proof Sketch
Compute the complement: the probability that all birthdays are distinct. Person 1 can have any birthday. Person 2 must avoid 1 birthday: probability . Person must avoid birthdays: probability . So:
Using :
Setting this to gives . For , this gives , confirming the threshold is 23.
Why It Matters
The scaling is the key insight. It means collision resistance requires squared output space. A hash with outputs is broken with attempts, not . This is the birthday attack in cryptography.
Failure Mode
The result assumes uniform distribution over categories. If birthdays (or hash outputs) are non-uniform, collisions happen sooner than the uniform case predicts. Non-uniformity only increases collision probability. The uniform case is the best case for collision avoidance.
Common Confusions
Matching MY birthday vs matching ANY birthday
The probability that someone in a room of 22 others shares your specific birthday is . But the probability that some pair among 23 people shares a birthday is . The difference is the number of pairs: vs .
Linear vs quadratic growth of pairs
People think of 23 people as a "small" group. But 23 people produce 253 pairs. The paradox is not about the number of people; it is about the number of pairwise comparisons, which grows as .
Applications Beyond Birthdays
Hash Collisions in Data Structures
Hash collision probability
A hash function produces 32-bit outputs (). After hashing distinct inputs, the collision probability exceeds 50%. This is . For a 64-bit hash, the threshold is about inputs.
Hash tables use the birthday paradox to set expectations for collision rates. A hash table with buckets and entries expects approximately collisions (for small collision probability). Choosing ensures few collisions, but requires more memory. The standard rule of thumb (load factor ) is a practical compromise informed by this analysis.
Random Sampling and Coupon Collection
When sampling with replacement from a population of items, the birthday paradox tells you that after draws, you will likely see a duplicate. This arises in bootstrapping: a bootstrap sample of size drawn with replacement from items has approximately unique items, because each item has probability of never being selected.
Cryptographic Birthday Attacks
A birthday attack finds two inputs with the same hash output. For a hash with -bit outputs, the attack requires evaluations, not . This is why secure hash functions use at least 256-bit outputs: operations is considered computationally infeasible, while is borderline feasible.
UUID collision probability
Version 4 UUIDs have 122 random bits (). Using the birthday approximation, the number of UUIDs needed for a 50% collision probability is . If you generate 1 billion UUIDs per second, you would need about 86 years to reach this threshold. For practical purposes, UUID collisions are not a concern.
Duplicate Detection in ML Datasets
Web-scraped training datasets frequently contain near-duplicates. The birthday paradox quantifies how likely this is. A corpus of documents, each represented by a random -bit hash (via MinHash or SimHash), will contain spurious hash collisions at rate . Deduplication pipelines use multi-probe hashing to detect these, with the birthday paradox setting the false positive rate.
Problem
A database assigns random 64-bit IDs to each record. After inserting 1 million records, what is the approximate probability of an ID collision?
Exercises
Problem
How many people do you need in a room for the probability of a shared birthday to exceed 99%?
Problem
A system generates random 128-bit session tokens. After how many tokens does the probability of a collision exceed ?
References
Canonical:
- Feller, An Introduction to Probability Theory and Its Applications, Vol. 1, Chapter 2
- Mitzenmacher & Upfal, Probability and Computing (2005), Chapter 1
Current:
-
Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms (2022), Section on hash functions
-
Munkres, Topology (2000), Chapter 1 (set theory review)
Next Topics
- Monty Hall problem: another probability puzzle where intuition fails
- Base-rate fallacy: conditional probability errors in diagnostic testing
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A