Applied Math
Cryptographic Hash Functions
One-way compression of arbitrary data to fixed-length digests. Collision resistance, preimage resistance, the birthday attack bound, Merkle-Damgard construction, and applications from digital signatures to ML model fingerprinting.
Why This Matters
Every time you verify a software download, check a Git commit, or validate an ML model checkpoint, you rely on cryptographic hash functions. A hash function compresses arbitrary-length input to a fixed-length output (the digest) in a way that is easy to compute forward but infeasible to invert. The security of digital signatures, commitment schemes, proof-of-work systems, and data integrity checks all reduce to the hardness of finding collisions or preimages in hash functions. The birthday paradox determines the actual security margin: an -bit hash provides only bits of collision resistance, not .
Core Definitions
Hash Function
A cryptographic hash function maps arbitrary-length binary strings to fixed-length -bit digests. It must be efficiently computable: given input , computing takes time polynomial in .
Preimage Resistance (One-Wayness)
Given a hash output , it is computationally infeasible to find any such that . Formally, for any probabilistic polynomial-time adversary :
where for a randomly chosen .
Second Preimage Resistance
Given an input , it is computationally infeasible to find a distinct such that . This is stronger than preimage resistance: the adversary has a specific target input, not just a target output.
Collision Resistance
It is computationally infeasible to find any pair with such that . The adversary gets to choose both inputs freely. Collision resistance implies second preimage resistance (for most practical parameter ranges), but not vice versa.
The Birthday Bound
The fundamental limit on collision resistance comes from the birthday paradox. If a hash function has possible outputs and behaves like a random function, the expected number of evaluations to find a collision is .
Birthday Bound on Hash Collisions
Statement
For a hash function modeled as a random oracle, an adversary making queries finds a collision with probability:
A collision becomes likely (probability exceeding ) when .
Intuition
Each new hash evaluation must avoid colliding with all previous outputs. There are pairs, and each pair collides with probability . The number of pairs grows quadratically in , so collisions appear after roughly evaluations, not . This is why a 256-bit hash provides 128 bits of collision resistance, not 256.
Proof Sketch
The probability that all outputs are distinct is:
Using for :
Setting this to and solving: , giving .
Why It Matters
The birthday bound determines minimum hash output sizes for security. For 128-bit collision resistance, you need at least 256-bit output (SHA-256). MD5 (128-bit output) provides only 64-bit collision resistance, and practical collision attacks exist. SHA-1 (160-bit output) provides at most 80-bit collision resistance and was broken in 2017 (SHAttered attack required about evaluations).
Failure Mode
The birthday bound assumes the hash behaves as a random oracle. Real hash functions have structure that attackers exploit. MD5 collisions were found in operations (far below the birthday bound) because of differential cryptanalysis on its internal structure. The random oracle model gives an upper bound on security, not a guarantee.
The Random Oracle Model
Random Oracle
A random oracle is an idealized hash function that maps each unique input to a uniformly random output, with the constraint that the same input always produces the same output. Formally, it is a function chosen uniformly at random from all functions . Security proofs in the random oracle model assume the hash function behaves this way. Real hash functions (SHA-256, SHA-3) are deterministic constructions that do not literally satisfy this model, but the model provides useful security arguments when instantiated with well-designed functions.
Merkle-Damgard Construction
Most classical hash functions (MD5, SHA-1, SHA-256) follow the Merkle-Damgard construction.
Merkle-Damgard Construction
Given a fixed-length compression function that maps bits to bits, the Merkle-Damgard construction builds a hash of arbitrary-length input:
- Pad the message to a multiple of bits (with length encoding in the padding)
- Split into blocks:
- Set (fixed initialization vector)
- For : compute
- Output
The key property: if is collision-resistant, then the full hash is collision-resistant.
Merkle-Damgard Security Reduction
Statement
If the compression function is collision-resistant, then the Merkle-Damgard hash function constructed from is also collision-resistant. Formally, any collision in implies a collision in .
Intuition
Suppose you find with . Process both messages through the chain. Since the final outputs match (), either the last compression inputs match (and the collision propagates backward) or they differ (giving a direct collision in ). The length-encoding in the padding ensures messages of different lengths produce different final blocks, preventing trivial length-extension collisions.
Proof Sketch
Let and be the intermediate states for and . Since , consider the last step: . If the inputs differ, we have a collision in . If they are equal, then and , so we recurse. Since , at some step the inputs must differ, yielding a collision in .
Why It Matters
This reduction means hash function designers only need to build a secure compression function for fixed-length inputs. The Merkle-Damgard framework handles arbitrary-length messages. SHA-256 uses this pattern with a Davies-Meyer compression function based on a block cipher.
Failure Mode
Merkle-Damgard is vulnerable to length extension attacks. Given without knowing , an attacker can compute for any suffix , because equals the internal state and the attacker can continue the chain. This breaks naive MAC constructions like . SHA-3 (sponge construction) does not have this vulnerability.
Real Hash Functions
SHA-256 (SHA-2 family): 256-bit output, Merkle-Damgard construction with Davies-Meyer compression function using 64 rounds of mixing. Provides 128-bit collision resistance under the birthday bound. No practical attacks below the birthday bound as of 2026. Used in Bitcoin, TLS, Git.
SHA-3 (Keccak): 256-bit output (SHA3-256), sponge construction instead of Merkle-Damgard. Absorbs input blocks into internal state, then squeezes output. Immune to length extension attacks. Designed with a wide internal state (1600 bits) to provide a large security margin.
MD5: 128-bit output, broken. Practical collision attacks in seconds on commodity hardware. Do not use for any security purpose. Still found in legacy systems.
SHA-1: 160-bit output, broken. The SHAttered attack (2017) produced a collision with computations. Deprecated by NIST and major browsers.
Length Extension Attacks
H(key || message) is NOT a secure MAC
With Merkle-Damgard hashes, knowing lets an attacker compute for any without knowing the key. The attacker uses as the intermediate state and continues hashing. Use HMAC instead: , which is provably secure.
Applications
Digital signatures: Sign instead of directly. The hash compresses a large message to a fixed-size digest for the signature algorithm. Collision resistance ensures an adversary cannot find with to forge a signature.
Commitment schemes: To commit to a value , publish where is a random nonce. Later reveal and . Preimage resistance ensures you cannot change after committing. The nonce hides (hiding property).
Proof-of-work: Bitcoin mining requires finding a nonce such that has leading zeros. Preimage resistance ensures the only strategy is brute-force search, requiring expected evaluations.
ML model checksums: Model registries (Hugging Face, PyTorch Hub) publish SHA-256 hashes of model weights. Users verify downloaded weights match the published hash. Second preimage resistance ensures an adversary cannot produce different weights with the same hash.
Locality-sensitive hashing (LSH): A separate concept from cryptographic hashing. LSH functions are designed so that similar inputs produce similar (or colliding) outputs, the exact opposite of cryptographic hashes. LSH is used for approximate nearest neighbor search, not for security.
Common Confusions
Collision resistance does not imply preimage resistance in general
For pathological constructions, a function can be collision-resistant but not preimage-resistant (though this is contrived). In practice, well-designed hash functions provide all three properties simultaneously. The theoretical separations matter for understanding the definitional hierarchy, not for choosing practical hash functions.
128-bit hash does not mean 128-bit security against collisions
A hash with -bit output provides at most bits of collision resistance due to the birthday bound. MD5 (128-bit output) has at most 64-bit collision resistance, and in practice much less due to structural attacks. For 128-bit collision security, use a 256-bit hash.
Cryptographic hashes vs checksums
CRC32 and Adler-32 are checksums for detecting accidental corruption. They are not collision-resistant: finding collisions is trivial. Cryptographic hashes resist deliberate adversarial collision finding. Use CRCs for error detection in noisy channels, cryptographic hashes for integrity against adversaries.
Exercises
Problem
A system uses a 64-bit hash function. Approximately how many inputs must an adversary hash before finding a collision with probability exceeding 50%?
Problem
Explain why is insecure as a MAC when is a Merkle-Damgard hash, but HMAC is secure. What property of Merkle-Damgard enables the attack?
Problem
You are designing a content-addressable storage system. Each object is identified by its SHA-256 hash. The system stores objects. What is the approximate probability of an accidental collision? Is this acceptable?
Problem
Prove that collision resistance implies second preimage resistance (for hash functions with output shorter than input). Specifically, show that an efficient second preimage finder can be converted into an efficient collision finder.
References
Canonical:
- Katz & Lindell, Introduction to Modern Cryptography (3rd ed., 2020), Chapters 6-7
- Menezes, Oorschot & Vanstone, Handbook of Applied Cryptography (1996), Chapter 9
Current:
- Rogaway & Shrimpton, "Cryptographic Hash-Function Basics" (2004), Sections 2-4
- NIST FIPS 180-4, "Secure Hash Standard" (2015), SHA-256 specification
- Bertoni et al., "The Keccak Reference" (2011), SHA-3 sponge construction
- Stevens et al., "The First Collision for Full SHA-1" (CRYPTO 2017)
- Mitzenmacher & Upfal, Probability and Computing (2nd ed., 2017), Chapter 5 on hashing
Next Topics
- Public-key cryptography: hash functions are building blocks for signatures, commitments, and authenticated data structures
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A