Cryptographic Hash Functions

Sneiderman, Robby

Applied Math

Cryptographic Hash Functions

One-way compression of arbitrary data to fixed-length digests. Collision resistance, preimage resistance, the birthday attack bound, Merkle-Damgard construction, and applications from digital signatures to ML model fingerprinting.

CoreTier 2StableReference~40 min

Prerequisites

Common Probability Distributions Sets Functions and Relations

Start 8-question practice · 4 available Prereq Map

Learning position

Read this page in the graph.

applied-math | layer 2 | tier 2. This page has 2 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Public-Key Cryptography

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every time you verify a software download, check a Git commit, or validate an ML model checkpoint, you rely on cryptographic hash functions. A hash function compresses arbitrary-length input to a fixed-length output (the digest) in a way that is easy to compute forward but infeasible to invert. The security of digital signatures, commitment schemes, proof-of-work systems, and data integrity checks all reduce to the hardness of finding collisions or preimages in hash functions. The birthday paradox determines the actual security margin: an $n$ -bit hash provides only $n/2$ bits of collision resistance, not $n$ .

Core Definitions

Definition

Hash Function $H : {0, 1}^{*} \to {0, 1}^{n}$

A cryptographic hash function maps arbitrary-length binary strings to fixed-length $n$ -bit digests. It must be efficiently computable: given input $m$ , computing $H(m)$ takes time polynomial in $|m|$ .

Definition

Preimage Resistance (One-Wayness)

Given a hash output $y$ , it is computationally infeasible to find any $m$ such that $H(m) = y$ . Formally, for any probabilistic polynomial-time adversary $\mathcal{A}$ :

$\Pr[H(\mathcal{A}(y)) = y] \leq \text{negl}(n)$

where $y = H(m)$ for a randomly chosen $m$ .

Definition

Second Preimage Resistance

Given an input $m_1$ , it is computationally infeasible to find a distinct $m_2 \neq m_1$ such that $H(m_1) = H(m_2)$ . This is stronger than preimage resistance: the adversary has a specific target input, not just a target output.

Definition

Collision Resistance

It is computationally infeasible to find any pair $(m_1, m_2)$ with $m_1 \neq m_2$ such that $H(m_1) = H(m_2)$ . The adversary gets to choose both inputs freely. Collision resistance implies second preimage resistance (for most practical parameter ranges), but not vice versa.

The Birthday Bound

The fundamental limit on collision resistance comes from the birthday paradox. If a hash function has $N = 2^n$ possible outputs and behaves like a random function, the expected number of evaluations to find a collision is $O(\sqrt{N}) = O(2^{n/2})$ .

Theorem

Birthday Bound on Hash Collisions

Statement

For a hash function $H: \{0,1\}^* \to \{0,1\}^n$ modeled as a random oracle, an adversary making $q$ queries finds a collision with probability:

$P(\text{collision}) = 1 - \prod_{i=0}^{q-1}\left(1 - \frac{i}{2^n}\right) \approx 1 - e^{-q^2/(2 \cdot 2^n)}$

A collision becomes likely (probability exceeding $1/2$ ) when $q \approx 1.177 \cdot 2^{n/2}$ .

Intuition

Each new hash evaluation must avoid colliding with all previous outputs. There are $\binom{q}{2}$ pairs, and each pair collides with probability $2^{-n}$ . The number of pairs grows quadratically in $q$ , so collisions appear after roughly $\sqrt{2^n}$ evaluations, not $2^n$ . This is why a 256-bit hash provides 128 bits of collision resistance, not 256.

Proof Sketch

The probability that all $q$ outputs are distinct is:

$\prod_{i=0}^{q-1}\left(1 - \frac{i}{2^n}\right)$

Using $1 - x \leq e^{-x}$ for $x \geq 0$ :

$\prod_{i=0}^{q-1}\left(1 - \frac{i}{2^n}\right) \leq \exp\left(-\sum_{i=0}^{q-1}\frac{i}{2^n}\right) = \exp\left(-\frac{q(q-1)}{2^{n+1}}\right)$

Setting this to $1/2$ and solving: $q(q-1)/(2^{n+1}) = \ln 2$ , giving $q \approx \sqrt{2^{n+1} \ln 2} = 1.177 \cdot 2^{n/2}$ .

Why It Matters

The birthday bound determines minimum hash output sizes for security. For 128-bit collision resistance, you need at least 256-bit output (SHA-256). MD5 (128-bit output) provides only 64-bit collision resistance, and practical collision attacks exist. SHA-1 (160-bit output) provides at most 80-bit collision resistance and was broken in 2017 (SHAttered attack required about $2^{63}$ evaluations).

Failure Mode

The birthday bound assumes the hash behaves as a random oracle. Real hash functions have structure that attackers exploit. MD5 collisions were found in $2^{24}$ operations (far below the $2^{64}$ birthday bound) because of differential cryptanalysis on its internal structure. The random oracle model gives an upper bound on security, not a guarantee.

report a correction →

The Random Oracle Model

Definition

Random Oracle $O$

A random oracle is an idealized hash function that maps each unique input to a uniformly random output, with the constraint that the same input always produces the same output. Formally, it is a function chosen uniformly at random from all functions $\{0,1\}^* \to \{0,1\}^n$ . Security proofs in the random oracle model assume the hash function behaves this way. Real hash functions (SHA-256, SHA-3) are deterministic constructions that do not literally satisfy this model, but the model provides useful security arguments when instantiated with well-designed functions.

Merkle-Damgard Construction

Most classical hash functions (MD5, SHA-1, SHA-256) follow the Merkle-Damgard construction.

Definition

Merkle-Damgard Construction

Given a fixed-length compression function $f: \{0,1\}^{n+b} \to \{0,1\}^n$ that maps $(n+b)$ bits to $n$ bits, the Merkle-Damgard construction builds a hash of arbitrary-length input:

Pad the message $m$ to a multiple of $b$ bits (with length encoding in the padding)
Split into blocks: $m_1, m_2, \ldots, m_L$
Set $h_0 = IV$ (fixed initialization vector)
For $i = 1, \ldots, L$ : compute $h_i = f(h_{i-1} \| m_i)$
Output $h_L$

The key property: if $f$ is collision-resistant, then the full hash $H$ is collision-resistant.

Theorem

Merkle-Damgard Security Reduction

Statement

If the compression function $f: \{0,1\}^{n+b} \to \{0,1\}^n$ is collision-resistant, then the Merkle-Damgard hash function $H$ constructed from $f$ is also collision-resistant. Formally, any collision in $H$ implies a collision in $f$ .

Intuition

Suppose you find $m \neq m'$ with $H(m) = H(m')$ . Process both messages through the chain. Since the final outputs match ( $h_L = h'_{L'}$ ), either the last compression inputs match (and the collision propagates backward) or they differ (giving a direct collision in $f$ ). The length-encoding in the padding ensures messages of different lengths produce different final blocks, preventing trivial length-extension collisions.

Proof Sketch

Let $h_1, \ldots, h_L$ and $h'_1, \ldots, h'_{L'}$ be the intermediate states for $m$ and $m'$ . Since $h_L = h'_{L'}$ , consider the last step: $f(h_{L-1} \| m_L) = f(h'_{L'-1} \| m'_{L'})$ . If the inputs differ, we have a collision in $f$ . If they are equal, then $h_{L-1} = h'_{L'-1}$ and $m_L = m'_{L'}$ , so we recurse. Since $m \neq m'$ , at some step the inputs must differ, yielding a collision in $f$ .

Why It Matters

This reduction means hash function designers only need to build a secure compression function for fixed-length inputs. The Merkle-Damgard framework handles arbitrary-length messages. SHA-256 uses this pattern with a Davies-Meyer compression function based on a block cipher.

Failure Mode

Merkle-Damgard is vulnerable to length extension attacks. Given $H(m)$ without knowing $m$ , an attacker can compute $H(m \| \text{padding} \| m')$ for any suffix $m'$ , because $H(m)$ equals the internal state $h_L$ and the attacker can continue the chain. This breaks naive MAC constructions like $H(\text{key} \| \text{message})$ . SHA-3 (sponge construction) does not have this vulnerability.

report a correction →

Real Hash Functions

SHA-256 (SHA-2 family): 256-bit output, Merkle-Damgard construction with Davies-Meyer compression function using 64 rounds of mixing. Provides 128-bit collision resistance under the birthday bound. No practical attacks below the birthday bound as of 2026. Used in Bitcoin, TLS, Git.

SHA-3 (Keccak): 256-bit output (SHA3-256), sponge construction instead of Merkle-Damgard. Absorbs input blocks into internal state, then squeezes output. Immune to length extension attacks. Designed with a wide internal state (1600 bits) to provide a large security margin.

MD5: 128-bit output, broken. Practical collision attacks in seconds on commodity hardware. Do not use for any security purpose. Still found in legacy systems.

SHA-1: 160-bit output, broken. The SHAttered attack (2017) produced a collision with $2^{63}$ computations. Deprecated by NIST and major browsers.

Length Extension Attacks

Watch Out

H(key || message) is NOT a secure MAC

With Merkle-Damgard hashes, knowing $H(\text{key} \| m)$ lets an attacker compute $H(\text{key} \| m \| \text{pad} \| m')$ for any $m'$ without knowing the key. The attacker uses $H(\text{key} \| m)$ as the intermediate state and continues hashing. Use HMAC instead: $\text{HMAC}(k, m) = H((k \oplus \text{opad}) \| H((k \oplus \text{ipad}) \| m))$ , which is provably secure.

Applications

Digital signatures: Sign $H(m)$ instead of $m$ directly. The hash compresses a large message to a fixed-size digest for the signature algorithm. Collision resistance ensures an adversary cannot find $m' \neq m$ with $H(m') = H(m)$ to forge a signature.

Commitment schemes: To commit to a value $v$ , publish $H(v \| r)$ where $r$ is a random nonce. Later reveal $v$ and $r$ . Preimage resistance ensures you cannot change $v$ after committing. The nonce $r$ hides $v$ (hiding property).

Proof-of-work: Bitcoin mining requires finding a nonce such that $H(\text{block} \| \text{nonce})$ has $k$ leading zeros. Preimage resistance ensures the only strategy is brute-force search, requiring $O(2^k)$ expected evaluations.

ML model checksums: Model registries (Hugging Face, PyTorch Hub) publish SHA-256 hashes of model weights. Users verify downloaded weights match the published hash. Second preimage resistance ensures an adversary cannot produce different weights with the same hash.

Locality-sensitive hashing (LSH): A separate concept from cryptographic hashing. LSH functions are designed so that similar inputs produce similar (or colliding) outputs, the exact opposite of cryptographic hashes. LSH is used for approximate nearest neighbor search, not for security.

Common Confusions

Watch Out

Collision resistance does not imply preimage resistance in general

For pathological constructions, a function can be collision-resistant but not preimage-resistant (though this is contrived). In practice, well-designed hash functions provide all three properties simultaneously. The theoretical separations matter for understanding the definitional hierarchy, not for choosing practical hash functions.

Watch Out

128-bit hash does not mean 128-bit security against collisions

A hash with $n$ -bit output provides at most $n/2$ bits of collision resistance due to the birthday bound. MD5 (128-bit output) has at most 64-bit collision resistance, and in practice much less due to structural attacks. For 128-bit collision security, use a 256-bit hash.

Watch Out

Cryptographic hashes vs checksums

CRC32 and Adler-32 are checksums for detecting accidental corruption. They are not collision-resistant: finding collisions is trivial. Cryptographic hashes resist deliberate adversarial collision finding. Use CRCs for error detection in noisy channels, cryptographic hashes for integrity against adversaries.

Exercises

ExerciseCore

Problem

A system uses a 64-bit hash function. Approximately how many inputs must an adversary hash before finding a collision with probability exceeding 50%?

ExerciseCore

Problem

Explain why $H(\text{key} \| \text{message})$ is insecure as a MAC when $H$ is a Merkle-Damgard hash, but HMAC is secure. What property of Merkle-Damgard enables the attack?

ExerciseAdvanced

Problem

You are designing a content-addressable storage system. Each object is identified by its SHA-256 hash. The system stores $10^{12}$ objects. What is the approximate probability of an accidental collision? Is this acceptable?

ExerciseAdvanced

Problem

Prove that collision resistance implies second preimage resistance (for hash functions with output shorter than input). Specifically, show that an efficient second preimage finder can be converted into an efficient collision finder.

References

Canonical:

Katz & Lindell, Introduction to Modern Cryptography (3rd ed., 2020), Chapters 6-7
Menezes, Oorschot & Vanstone, Handbook of Applied Cryptography (1996), Chapter 9

Current:

Rogaway & Shrimpton, "Cryptographic Hash-Function Basics" (2004), Sections 2-4
NIST FIPS 180-4, "Secure Hash Standard" (2015), SHA-256 specification
Bertoni et al., "The Keccak Reference" (2011), SHA-3 sponge construction
Stevens et al., "The First Collision for Full SHA-1" (CRYPTO 2017)
Mitzenmacher & Upfal, Probability and Computing (2nd ed., 2017), Chapter 5 on hashing

Next Topics

Public-key cryptography: hash functions are building blocks for signatures, commitments, and authenticated data structures

Last reviewed: April 15, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Common Probability Distributionslayer 0A · tier 1
Sets, Functions, and Relationslayer 0A · tier 1

Derived topics

1

Public-Key Cryptographylayer 2 · tier 2

Graph-backed continuations

Public-Key Cryptography