Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Applied Math

Cryptographic Hash Functions

One-way compression of arbitrary data to fixed-length digests. Collision resistance, preimage resistance, the birthday attack bound, Merkle-Damgard construction, and applications from digital signatures to ML model fingerprinting.

CoreTier 2Stable~40 min

Why This Matters

Every time you verify a software download, check a Git commit, or validate an ML model checkpoint, you rely on cryptographic hash functions. A hash function compresses arbitrary-length input to a fixed-length output (the digest) in a way that is easy to compute forward but infeasible to invert. The security of digital signatures, commitment schemes, proof-of-work systems, and data integrity checks all reduce to the hardness of finding collisions or preimages in hash functions. The birthday paradox determines the actual security margin: an nn-bit hash provides only n/2n/2 bits of collision resistance, not nn.

Core Definitions

Definition

Hash Function

A cryptographic hash function maps arbitrary-length binary strings to fixed-length nn-bit digests. It must be efficiently computable: given input mm, computing H(m)H(m) takes time polynomial in m|m|.

Definition

Preimage Resistance (One-Wayness)

Given a hash output yy, it is computationally infeasible to find any mm such that H(m)=yH(m) = y. Formally, for any probabilistic polynomial-time adversary A\mathcal{A}:

Pr[H(A(y))=y]negl(n)\Pr[H(\mathcal{A}(y)) = y] \leq \text{negl}(n)

where y=H(m)y = H(m) for a randomly chosen mm.

Definition

Second Preimage Resistance

Given an input m1m_1, it is computationally infeasible to find a distinct m2m1m_2 \neq m_1 such that H(m1)=H(m2)H(m_1) = H(m_2). This is stronger than preimage resistance: the adversary has a specific target input, not just a target output.

Definition

Collision Resistance

It is computationally infeasible to find any pair (m1,m2)(m_1, m_2) with m1m2m_1 \neq m_2 such that H(m1)=H(m2)H(m_1) = H(m_2). The adversary gets to choose both inputs freely. Collision resistance implies second preimage resistance (for most practical parameter ranges), but not vice versa.

The Birthday Bound

The fundamental limit on collision resistance comes from the birthday paradox. If a hash function has N=2nN = 2^n possible outputs and behaves like a random function, the expected number of evaluations to find a collision is O(N)=O(2n/2)O(\sqrt{N}) = O(2^{n/2}).

Theorem

Birthday Bound on Hash Collisions

Statement

For a hash function H:{0,1}{0,1}nH: \{0,1\}^* \to \{0,1\}^n modeled as a random oracle, an adversary making qq queries finds a collision with probability:

P(collision)=1i=0q1(1i2n)1eq2/(22n)P(\text{collision}) = 1 - \prod_{i=0}^{q-1}\left(1 - \frac{i}{2^n}\right) \approx 1 - e^{-q^2/(2 \cdot 2^n)}

A collision becomes likely (probability exceeding 1/21/2) when q1.1772n/2q \approx 1.177 \cdot 2^{n/2}.

Intuition

Each new hash evaluation must avoid colliding with all previous outputs. There are (q2)\binom{q}{2} pairs, and each pair collides with probability 2n2^{-n}. The number of pairs grows quadratically in qq, so collisions appear after roughly 2n\sqrt{2^n} evaluations, not 2n2^n. This is why a 256-bit hash provides 128 bits of collision resistance, not 256.

Proof Sketch

The probability that all qq outputs are distinct is:

i=0q1(1i2n)\prod_{i=0}^{q-1}\left(1 - \frac{i}{2^n}\right)

Using 1xex1 - x \leq e^{-x} for x0x \geq 0:

i=0q1(1i2n)exp(i=0q1i2n)=exp(q(q1)2n+1)\prod_{i=0}^{q-1}\left(1 - \frac{i}{2^n}\right) \leq \exp\left(-\sum_{i=0}^{q-1}\frac{i}{2^n}\right) = \exp\left(-\frac{q(q-1)}{2^{n+1}}\right)

Setting this to 1/21/2 and solving: q(q1)/(2n+1)=ln2q(q-1)/(2^{n+1}) = \ln 2, giving q2n+1ln2=1.1772n/2q \approx \sqrt{2^{n+1} \ln 2} = 1.177 \cdot 2^{n/2}.

Why It Matters

The birthday bound determines minimum hash output sizes for security. For 128-bit collision resistance, you need at least 256-bit output (SHA-256). MD5 (128-bit output) provides only 64-bit collision resistance, and practical collision attacks exist. SHA-1 (160-bit output) provides at most 80-bit collision resistance and was broken in 2017 (SHAttered attack required about 2632^{63} evaluations).

Failure Mode

The birthday bound assumes the hash behaves as a random oracle. Real hash functions have structure that attackers exploit. MD5 collisions were found in 2242^{24} operations (far below the 2642^{64} birthday bound) because of differential cryptanalysis on its internal structure. The random oracle model gives an upper bound on security, not a guarantee.

The Random Oracle Model

Definition

Random Oracle

A random oracle is an idealized hash function that maps each unique input to a uniformly random output, with the constraint that the same input always produces the same output. Formally, it is a function chosen uniformly at random from all functions {0,1}{0,1}n\{0,1\}^* \to \{0,1\}^n. Security proofs in the random oracle model assume the hash function behaves this way. Real hash functions (SHA-256, SHA-3) are deterministic constructions that do not literally satisfy this model, but the model provides useful security arguments when instantiated with well-designed functions.

Merkle-Damgard Construction

Most classical hash functions (MD5, SHA-1, SHA-256) follow the Merkle-Damgard construction.

Definition

Merkle-Damgard Construction

Given a fixed-length compression function f:{0,1}n+b{0,1}nf: \{0,1\}^{n+b} \to \{0,1\}^n that maps (n+b)(n+b) bits to nn bits, the Merkle-Damgard construction builds a hash of arbitrary-length input:

  1. Pad the message mm to a multiple of bb bits (with length encoding in the padding)
  2. Split into blocks: m1,m2,,mLm_1, m_2, \ldots, m_L
  3. Set h0=IVh_0 = IV (fixed initialization vector)
  4. For i=1,,Li = 1, \ldots, L: compute hi=f(hi1mi)h_i = f(h_{i-1} \| m_i)
  5. Output hLh_L

The key property: if ff is collision-resistant, then the full hash HH is collision-resistant.

Theorem

Merkle-Damgard Security Reduction

Statement

If the compression function f:{0,1}n+b{0,1}nf: \{0,1\}^{n+b} \to \{0,1\}^n is collision-resistant, then the Merkle-Damgard hash function HH constructed from ff is also collision-resistant. Formally, any collision in HH implies a collision in ff.

Intuition

Suppose you find mmm \neq m' with H(m)=H(m)H(m) = H(m'). Process both messages through the chain. Since the final outputs match (hL=hLh_L = h'_{L'}), either the last compression inputs match (and the collision propagates backward) or they differ (giving a direct collision in ff). The length-encoding in the padding ensures messages of different lengths produce different final blocks, preventing trivial length-extension collisions.

Proof Sketch

Let h1,,hLh_1, \ldots, h_L and h1,,hLh'_1, \ldots, h'_{L'} be the intermediate states for mm and mm'. Since hL=hLh_L = h'_{L'}, consider the last step: f(hL1mL)=f(hL1mL)f(h_{L-1} \| m_L) = f(h'_{L'-1} \| m'_{L'}). If the inputs differ, we have a collision in ff. If they are equal, then hL1=hL1h_{L-1} = h'_{L'-1} and mL=mLm_L = m'_{L'}, so we recurse. Since mmm \neq m', at some step the inputs must differ, yielding a collision in ff.

Why It Matters

This reduction means hash function designers only need to build a secure compression function for fixed-length inputs. The Merkle-Damgard framework handles arbitrary-length messages. SHA-256 uses this pattern with a Davies-Meyer compression function based on a block cipher.

Failure Mode

Merkle-Damgard is vulnerable to length extension attacks. Given H(m)H(m) without knowing mm, an attacker can compute H(mpaddingm)H(m \| \text{padding} \| m') for any suffix mm', because H(m)H(m) equals the internal state hLh_L and the attacker can continue the chain. This breaks naive MAC constructions like H(keymessage)H(\text{key} \| \text{message}). SHA-3 (sponge construction) does not have this vulnerability.

Real Hash Functions

SHA-256 (SHA-2 family): 256-bit output, Merkle-Damgard construction with Davies-Meyer compression function using 64 rounds of mixing. Provides 128-bit collision resistance under the birthday bound. No practical attacks below the birthday bound as of 2026. Used in Bitcoin, TLS, Git.

SHA-3 (Keccak): 256-bit output (SHA3-256), sponge construction instead of Merkle-Damgard. Absorbs input blocks into internal state, then squeezes output. Immune to length extension attacks. Designed with a wide internal state (1600 bits) to provide a large security margin.

MD5: 128-bit output, broken. Practical collision attacks in seconds on commodity hardware. Do not use for any security purpose. Still found in legacy systems.

SHA-1: 160-bit output, broken. The SHAttered attack (2017) produced a collision with 2632^{63} computations. Deprecated by NIST and major browsers.

Length Extension Attacks

Watch Out

H(key || message) is NOT a secure MAC

With Merkle-Damgard hashes, knowing H(keym)H(\text{key} \| m) lets an attacker compute H(keympadm)H(\text{key} \| m \| \text{pad} \| m') for any mm' without knowing the key. The attacker uses H(keym)H(\text{key} \| m) as the intermediate state and continues hashing. Use HMAC instead: HMAC(k,m)=H((kopad)H((kipad)m))\text{HMAC}(k, m) = H((k \oplus \text{opad}) \| H((k \oplus \text{ipad}) \| m)), which is provably secure.

Applications

Digital signatures: Sign H(m)H(m) instead of mm directly. The hash compresses a large message to a fixed-size digest for the signature algorithm. Collision resistance ensures an adversary cannot find mmm' \neq m with H(m)=H(m)H(m') = H(m) to forge a signature.

Commitment schemes: To commit to a value vv, publish H(vr)H(v \| r) where rr is a random nonce. Later reveal vv and rr. Preimage resistance ensures you cannot change vv after committing. The nonce rr hides vv (hiding property).

Proof-of-work: Bitcoin mining requires finding a nonce such that H(blocknonce)H(\text{block} \| \text{nonce}) has kk leading zeros. Preimage resistance ensures the only strategy is brute-force search, requiring O(2k)O(2^k) expected evaluations.

ML model checksums: Model registries (Hugging Face, PyTorch Hub) publish SHA-256 hashes of model weights. Users verify downloaded weights match the published hash. Second preimage resistance ensures an adversary cannot produce different weights with the same hash.

Locality-sensitive hashing (LSH): A separate concept from cryptographic hashing. LSH functions are designed so that similar inputs produce similar (or colliding) outputs, the exact opposite of cryptographic hashes. LSH is used for approximate nearest neighbor search, not for security.

Common Confusions

Watch Out

Collision resistance does not imply preimage resistance in general

For pathological constructions, a function can be collision-resistant but not preimage-resistant (though this is contrived). In practice, well-designed hash functions provide all three properties simultaneously. The theoretical separations matter for understanding the definitional hierarchy, not for choosing practical hash functions.

Watch Out

128-bit hash does not mean 128-bit security against collisions

A hash with nn-bit output provides at most n/2n/2 bits of collision resistance due to the birthday bound. MD5 (128-bit output) has at most 64-bit collision resistance, and in practice much less due to structural attacks. For 128-bit collision security, use a 256-bit hash.

Watch Out

Cryptographic hashes vs checksums

CRC32 and Adler-32 are checksums for detecting accidental corruption. They are not collision-resistant: finding collisions is trivial. Cryptographic hashes resist deliberate adversarial collision finding. Use CRCs for error detection in noisy channels, cryptographic hashes for integrity against adversaries.

Exercises

ExerciseCore

Problem

A system uses a 64-bit hash function. Approximately how many inputs must an adversary hash before finding a collision with probability exceeding 50%?

ExerciseCore

Problem

Explain why H(keymessage)H(\text{key} \| \text{message}) is insecure as a MAC when HH is a Merkle-Damgard hash, but HMAC is secure. What property of Merkle-Damgard enables the attack?

ExerciseAdvanced

Problem

You are designing a content-addressable storage system. Each object is identified by its SHA-256 hash. The system stores 101210^{12} objects. What is the approximate probability of an accidental collision? Is this acceptable?

ExerciseAdvanced

Problem

Prove that collision resistance implies second preimage resistance (for hash functions with output shorter than input). Specifically, show that an efficient second preimage finder can be converted into an efficient collision finder.

References

Canonical:

  • Katz & Lindell, Introduction to Modern Cryptography (3rd ed., 2020), Chapters 6-7
  • Menezes, Oorschot & Vanstone, Handbook of Applied Cryptography (1996), Chapter 9

Current:

  • Rogaway & Shrimpton, "Cryptographic Hash-Function Basics" (2004), Sections 2-4
  • NIST FIPS 180-4, "Secure Hash Standard" (2015), SHA-256 specification
  • Bertoni et al., "The Keccak Reference" (2011), SHA-3 sponge construction
  • Stevens et al., "The First Collision for Full SHA-1" (CRYPTO 2017)
  • Mitzenmacher & Upfal, Probability and Computing (2nd ed., 2017), Chapter 5 on hashing

Next Topics

  • Public-key cryptography: hash functions are building blocks for signatures, commitments, and authenticated data structures

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics