Birthday Paradox

The birthday paradox is the canonical example of pairwise collision probability being much larger than people expect. The same mathematics governs hash collisions, random sampling overlaps, and the security of cryptographic hash functions. If a hash function has $N$ possible outputs, you only need about $\sqrt{N}$ random inputs before a collision becomes likely. This is why 128-bit hashes are not considered collision-resistant: $2^{64}$ attempts suffice.

Setup

Assume 365 equally likely birthdays (ignore leap years). Place $k$ people in a room. What is the probability that at least two share a birthday?

People guess this probability is small for $k = 23$ because they anchor on the wrong question. They ask "what is the probability someone shares my birthday?" (which is about $22/365 \approx 6\%$ ) instead of "what is the probability any pair collides?" There are $\binom{23}{2} = 253$ pairs, and the collision probability accumulates over all of them.

Definition

Birthday Collision Probability $P (k, N)$

The probability that among $k$ items drawn uniformly at random from $N$ categories, at least two items fall in the same category. Computed via the complement: $P(k, N) = 1 - \prod_{i=0}^{k-1}(1 - i/N)$ .

Main Theorems

Theorem

Birthday Collision Threshold

Statement

The probability that at least two of $k$ items share a category is:

$P(k, N) = 1 - \prod_{i=0}^{k-1}\left(1 - \frac{i}{N}\right)$

For $N = 365$ , $P(23, 365) > 0.5$ . More precisely, $P(23, 365) \approx 0.5073$ .

Intuition

Each new person must avoid all previously seen birthdays. The first person has no constraint. The second must avoid $1/365$ , the third must avoid $2/365$ , and so on. These "small" probabilities compound multiplicatively, and the product $\prod(1 - i/365)$ drops below $0.5$ faster than intuition suggests because there are $\binom{k}{2}$ pairs, which grows quadratically.

Proof Sketch

Compute the complement: the probability that all $k$ birthdays are distinct. Person 1 can have any birthday. Person 2 must avoid 1 birthday: probability $(N-1)/N$ . Person $i$ must avoid $i-1$ birthdays: probability $(N - i + 1)/N$ . So:

$P(\text{all distinct}) = \frac{N}{N} \cdot \frac{N-1}{N} \cdot \frac{N-2}{N} \cdots \frac{N-k+1}{N} = \prod_{i=0}^{k-1}\left(1 - \frac{i}{N}\right)$

Using $1 - x \leq e^{-x}$ :

$P(\text{all distinct}) \leq \exp\left(-\sum_{i=0}^{k-1} \frac{i}{N}\right) = \exp\left(-\frac{k(k-1)}{2N}\right)$

Setting this to $1/2$ gives $k \approx \sqrt{2N \ln 2} \approx 1.177\sqrt{N}$ . For $N = 365$ , this gives $k \approx 22.5$ , confirming the threshold is 23.

Why It Matters

The $\sqrt{N}$ scaling is the key insight. It means collision resistance requires squared output space. A hash with $2^{128}$ outputs is broken with $2^{64}$ attempts, not $2^{128}$ . This is the birthday attack in cryptography.

Failure Mode

The result assumes uniform distribution over categories. If birthdays (or hash outputs) are non-uniform, collisions happen sooner than the uniform case predicts. Non-uniformity only increases collision probability. The uniform case is the best case for collision avoidance.

report a correction →

Common Confusions

Watch Out

Matching MY birthday vs matching ANY birthday

The probability that someone in a room of 22 others shares your specific birthday is $1 - (364/365)^{22} \approx 5.9\%$ . But the probability that some pair among 23 people shares a birthday is $50.7\%$ . The difference is the number of pairs: $22$ vs $\binom{23}{2} = 253$ .

Watch Out

Linear vs quadratic growth of pairs

People think of 23 people as a "small" group. But 23 people produce 253 pairs. The paradox is not about the number of people; it is about the number of pairwise comparisons, which grows as $O(k^2)$ .

Applications Beyond Birthdays

Hash Collisions in Data Structures

Example

Hash collision probability

A hash function produces 32-bit outputs ( $N = 2^{32} \approx 4.3 \times 10^9$ ). After hashing $k = 77,163$ distinct inputs, the collision probability exceeds 50%. This is $\sqrt{2 \cdot 2^{32} \cdot \ln 2} \approx 77{,}163$ . For a 64-bit hash, the threshold is about $5.1 \times 10^9$ inputs.

Hash tables use the birthday paradox to set expectations for collision rates. A hash table with $N$ buckets and $k$ entries expects approximately $k^2/(2N)$ collisions (for small collision probability). Choosing $N \gg k^2$ ensures few collisions, but requires more memory. The standard rule of thumb (load factor $< 0.75$ ) is a practical compromise informed by this analysis.

Random Sampling and Coupon Collection

When sampling with replacement from a population of $N$ items, the birthday paradox tells you that after $O(\sqrt{N})$ draws, you will likely see a duplicate. This arises in bootstrapping: a bootstrap sample of size $n$ drawn with replacement from $n$ items has approximately $1 - 1/e \approx 63.2\%$ unique items, because each item has probability $(1 - 1/n)^n \approx e^{-1}$ of never being selected.

Cryptographic Birthday Attacks

A birthday attack finds two inputs with the same hash output. For a hash with $b$ -bit outputs, the attack requires $O(2^{b/2})$ evaluations, not $O(2^b)$ . This is why secure hash functions use at least 256-bit outputs: $2^{128}$ operations is considered computationally infeasible, while $2^{64}$ is borderline feasible.

Example

UUID collision probability

Version 4 UUIDs have 122 random bits ( $N = 2^{122}$ ). Using the birthday approximation, the number of UUIDs needed for a 50% collision probability is $\sqrt{2 \cdot 2^{122} \cdot \ln 2} \approx 2^{61.5} \approx 2.7 \times 10^{18}$ . If you generate 1 billion UUIDs per second, you would need about 86 years to reach this threshold. For practical purposes, UUID collisions are not a concern.

Duplicate Detection in ML Datasets

Web-scraped training datasets frequently contain near-duplicates. The birthday paradox quantifies how likely this is. A corpus of $k$ documents, each represented by a random $N$ -bit hash (via MinHash or SimHash), will contain spurious hash collisions at rate $k^2/(2N)$ . Deduplication pipelines use multi-probe hashing to detect these, with the birthday paradox setting the false positive rate.

ExerciseCore

Problem

A database assigns random 64-bit IDs to each record. After inserting 1 million records, what is the approximate probability of an ID collision?

Exercises

ExerciseCore

Problem

How many people do you need in a room for the probability of a shared birthday to exceed 99%?

ExerciseAdvanced

Problem

A system generates random 128-bit session tokens. After how many tokens does the probability of a collision exceed $10^{-6}$ ?

References

Canonical:

Feller, An Introduction to Probability Theory and Its Applications, Vol. 1, Chapter 2
Mitzenmacher & Upfal, Probability and Computing (2005), Chapter 1

Current:

Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms (2022), Section on hash functions
Munkres, Topology (2000), Chapter 1 (set theory review)

Next Topics

Monty Hall problem: another probability puzzle where intuition fails
Base-rate fallacy: conditional probability errors in diagnostic testing

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Common Probability Distributionslayer 0A · tier 1
Monty Hall Problemlayer 0A · tier 2

Derived topics

1

Base Rate Fallacylayer 1 · tier 2

Graph-backed continuations

Base Rate Fallacy