Flash Attention

Sneiderman, Robby

LLM Construction

Flash Attention

IO-aware exact attention: tile QKV matrices into SRAM-sized blocks so the full N-by-N attention matrix is never materialized in HBM. Peak activation memory drops from O(N²) to O(N); HBM read/write traffic drops by a large constant factor (not asymptotic linearity); FLOP count is unchanged.

AdvancedAdvancedTier 2CurrentSupporting~55 min

For:ML

Prerequisites

Attention Mechanism Theory Softmax and Numerical Stability Attention Is All You Need Paper Computer Architecture for ML

Start 8-question practice · 4 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 5 | tier 2. This page has 7 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Fused Kernels

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Standard transformer attention computes $\text{softmax}(QK^\top / \sqrt{d})V$ by materializing the full $N \times N$ attention matrix in GPU high-bandwidth memory (HBM). For sequence length $N = 128{,}000$ , this matrix has $16$ billion entries in FP16: roughly 32 GB. This does not fit in SRAM (tens of MB) and barely fits in HBM.

Flash Attention computes the exact same output without ever forming the full attention matrix. It does this by tiling the computation so that each block fits in SRAM, performing all the work there, and writing only the final result back to HBM. The output is mathematically identical. The speedup comes entirely from reducing memory traffic. See the Dao et al. paper breakdown for the online-softmax recurrence, the recompute-in-backward trade, and the FlashAttention-2/3 follow-ons.

Mental Model

Think of attention as a matrix multiply with a softmax in the middle. Standard implementations write the intermediate $N \times N$ matrix to HBM, then read it back to apply softmax, then write softmax output to HBM, then read it back to multiply by $V$ . Each of these read/write round-trips is expensive.

Flash Attention keeps intermediate results in SRAM (fast, small memory) and never writes the $N \times N$ matrix to HBM at all.

The IO Bottleneck

Definition

Arithmetic Intensity $A I = F L O P s / b y t es$

The ratio of floating-point operations to bytes transferred between HBM and compute units. When arithmetic intensity is low, the operation is memory-bound: the GPU spends most of its time waiting for data, not computing.

Standard attention has arithmetic intensity $O(1)$ for the softmax step: each element requires a few FLOPs but must be read from and written to HBM. The matrix multiply steps ( $QK^\top$ and $\text{softmax} \cdot V$ ) have higher arithmetic intensity but the softmax bottleneck dominates wall-clock time.

The Tiling Algorithm

The key idea: partition $Q$ , $K$ , $V$ into blocks that fit in SRAM.

Divide $Q$ into blocks $Q_1, \ldots, Q_{T_r}$ of size $B_r \times d$ .
Divide $K$ and $V$ into blocks $K_1, \ldots, K_{T_c}$ and $V_1, \ldots, V_{T_c}$ of size $B_c \times d$ .
For each query block $Q_i$ : iterate over all key-value blocks $(K_j, V_j)$ , computing partial attention scores in SRAM.
Accumulate the output using the online softmax trick (see below).

The block sizes $B_r$ and $B_c$ are chosen so that $Q_i$ , $K_j$ , $V_j$ , and the partial output all fit simultaneously in SRAM.

Online Softmax

The challenge with tiling is that softmax requires the full row of scores $s_i = Q_i K^\top / \sqrt{d}$ to compute the normalizer. You cannot compute $\text{softmax}(s_i)$ from blocks independently.

Definition

Online Softmax

Maintain running statistics $(m, \ell)$ where $m$ is the running row-wise maximum and $\ell$ is the running sum of exponentials. When processing a new block of scores $s_{ij}$ :

$m_{\text{new}} = \max(m_{\text{old}}, \max(s_{ij}))$

$\ell_{\text{new}} = \ell_{\text{old}} \cdot e^{m_{\text{old}} - m_{\text{new}}} + \sum_k e^{s_{ijk} - m_{\text{new}}}$

The output accumulator is rescaled by $e^{m_{\text{old}} - m_{\text{new}}}$ at each step to account for the updated maximum.

This produces the exact same result as computing softmax over the entire row at once.

Main Theorems

Theorem

Flash Attention IO Complexity

Statement

Standard attention requires $\Theta(Nd + N^2)$ HBM accesses. Flash Attention requires $\Theta(N^2 d^2 / M)$ HBM accesses.

Intuition

Each SRAM-sized tile of the computation is loaded once and fully processed before moving on. The total number of tiles is $\Theta(N^2 / (B_r B_c))$ and each tile loads $\Theta((B_r + B_c)d)$ bytes. The product gives the total IO.

Proof Sketch

There are $T_r = N/B_r$ query blocks and $T_c = N/B_c$ key-value blocks. The outer loop iterates over $T_r \cdot T_c = N^2 / (B_r B_c)$ block pairs. Each pair loads $B_r d + 2 B_c d$ elements from HBM. Total HBM reads: $N^2 d (B_r + 2B_c) / (B_r B_c)$ . Setting $B_c = M / (4d)$ and $B_r = \min(M/(4d), d)$ gives the stated $\Theta(N^2 d^2 / M)$ bound.

Why It Matters

Three quantities need to be kept separate. FLOPs remain $\Theta(N^2 d)$ (FlashAttention performs slightly more than naive attention due to recomputation on the backward pass). HBM traffic drops to $\Theta(N^2 d^2 / M)$ . Peak activation memory drops from $\Theta(N^2)$ to $\Theta(N)$ because the $N \times N$ attention matrix is never materialized in HBM.

In the operational regime, $M$ is fixed by hardware (for example, the H100 has about 228 KB of SRAM per SM, which is about 114,000 elements in FP16) while $N$ grows. For practical long-context workloads, $Nd$ is far larger than $M$ . At $N = 100{,}000$ , $d = 128$ , FP16, the array $Nd$ occupies about 25 MB, roughly $100\times$ the SRAM budget. Both standard attention and Flash Attention therefore remain $\Omega(N^2)$ in HBM traffic as $N \to \infty$ with $M$ fixed. The win in HBM traffic is a large constant-factor reduction of memory traffic by $\Theta(N^2 / (N^2 d^2 / M)) = \Theta(M / d^2)$ . For $M \approx 114{,}000$ elements and $d = 128$ in FP16, this constant factor is on the order of $10$ to $30\times$ fewer HBM bytes moved, which is enough to shift attention from memory-bound to closer to compute-bound on current accelerators. The separate win in peak memory (the matrix never appears in HBM at all) is what enables long-context training in the first place.

Failure Mode

The IO advantage diminishes when $d$ is very large relative to $M$ , because fewer elements fit in SRAM per tile. Also, if the attention pattern is sparse (most entries near zero), sparse attention methods from the efficient transformers survey may achieve even lower IO by skipping blocks entirely. Flash Attention computes exact dense attention and cannot exploit sparsity. A final pitfall: claims of "linear in $N$ " HBM traffic require $M = \Theta(Nd)$ , which is physically false on current GPUs once $N$ exceeds a few thousand.

report a correction →

Proposition

Online Softmax Equivalence

Statement

The online softmax algorithm with running maximum and denominator produces the same output as the two-pass softmax algorithm (first pass to find max and sum, second pass to normalize), up to floating-point rounding.

Intuition

The rescaling factor $e^{m_{\text{old}} - m_{\text{new}}}$ exactly compensates for having used a stale maximum in earlier blocks. The algebra telescopes: each partial sum, when rescaled, equals what it would have been if computed with the global maximum from the start.

Proof Sketch

By induction on the number of blocks. The base case (one block) is trivial. For the inductive step, the rescaled accumulator after $k+1$ blocks equals $\sum_{j=1}^{k+1} \sum_i e^{s_{ji} - m_{k+1}}$ , which is the same as the full-row computation with maximum $m_{k+1}$ .

Why It Matters

Without online softmax, tiled attention would produce approximate results. This proposition guarantees exact equivalence—Flash Attention is a pure systems optimization with zero accuracy cost.

Failure Mode

Floating-point rounding order differs between the tiled and standard implementations, so outputs may differ at the level of machine epsilon. This is not a mathematical failure but can cause bitwise non-determinism in practice.

report a correction →

Flash Attention 2 and 3

Flash Attention 2 (Dao 2023) improves work partitioning. The key changes: swap the loop order so the outer loop is over query blocks (better parallelism across warps), reduce non-matmul FLOPs, and partition work across warps within a thread block more evenly. Result: roughly $2\times$ speedup over Flash Attention 1.

Flash Attention 3 (Shah et al. 2024) targets Hopper GPUs (H100). Key ideas: asynchronous memory copies via TMA (tensor memory accelerator), FP8 quantized attention for further throughput gains, and warp specialization for overlapping computation with data movement. Shah et al. report about 740 TFLOP/s in FP16 on H100, roughly 75% of the 989 TFLOP/s FP16 peak. This is a large jump over Flash Attention 2 (about 35% utilization on the same hardware) but still below what dense GEMM kernels reach (typically 90%+ of peak). Framing FA3 as "near peak" overstates it. Framing it as "75% of peak, nearly double FA2 on H100" is accurate.

Common Confusions

Watch Out

Flash Attention does not approximate attention

Flash Attention computes exact standard attention. It is not an approximation scheme like Linformer, Performer, or random feature attention. The output is identical (up to floating-point rounding) to naive attention. The improvement is purely in IO efficiency.

Watch Out

Flash Attention reduces memory, not FLOPs

Flash Attention actually performs more total FLOPs than standard attention (due to recomputation in the backward pass). It is faster because wall-clock time for attention is dominated by memory access time, not compute time. Reducing IO is what matters.

Watch Out

SRAM is not programmer-visible shared memory alone

On NVIDIA GPUs, SRAM refers to the shared memory within each streaming multiprocessor. Its size (typically 48-228 KB per SM) determines the maximum block size in Flash Attention. This is distinct from L2 cache, which is larger but not directly addressable by the programmer.

Summary

Standard attention is memory-bound: the bottleneck is HBM reads/writes, not FLOPs
Flash Attention tiles QKV into SRAM-sized blocks and never materializes the full $N \times N$ attention matrix in HBM
Online softmax enables exact tiled computation by maintaining running statistics
IO traffic drops from $\Theta(N^2 + Nd)$ to $\Theta(N^2 d^2 / M)$ HBM bytes, a constant-factor $\Theta(M / d^2)$ reduction for fixed hardware $M$
Flash Attention 2 improves parallelism; Flash Attention 3 adds asynchrony and FP8

Exercises

ExerciseCore

Problem

A GPU has 192 KB of SRAM per SM and uses FP16 (2 bytes per element). The head dimension is $d = 128$ . What is the maximum block size $B_c$ for key and value blocks, assuming we need to fit $K_j$ ( $B_c \times d$ ) and $V_j$ ( $B_c \times d$ ) simultaneously in SRAM with half the SRAM reserved for other use?

ExerciseAdvanced

Problem

Prove that the online softmax rescaling is exact. Specifically, show that after processing blocks $1, \ldots, k$ , the accumulated output $O_k$ equals $\sum_{j=1}^{k} \text{softmax}(Q_i [K_1, \ldots, K_k]^\top / \sqrt{d})_{:, \text{block}_j} V_j$ where the softmax is computed over all $k$ blocks jointly.

Related Comparisons

FlashAttention vs. Vanilla Attention

References

Canonical:

Dao, Fu, Ermon, Rudra, Re, FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022, arXiv:2205.14135), Sections 3-4 for the IO analysis and tiling algorithm
Dao, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (2023, arXiv:2307.08691), Sections 3.1-3.2 for the loop-order and warp-partitioning changes

Current:

Shah, Bikshandi, Ye, Thakkar, Ramani, Gu, Dao, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (2024, arXiv:2407.08608), Section 3 for warp specialization and 4 for FP8
Liu, Zaharia, Abbeel, Ring Attention with Blockwise Transformers for Near-Infinite Context (2023, arXiv:2310.01889): distributes the blockwise computation across devices so effective SRAM scales with device count
Liu, Abbeel, Blockwise Parallel Transformer for Long Context Large Models (2023, arXiv:2305.19370): the blockwise-feedforward companion to blockwise attention
Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica, Efficient Memory Management for Large Language Model Serving with PagedAttention (2023, SOSP, arXiv:2309.06180): KV-cache paging at inference time, orthogonal to FlashAttention's training-time tiling
PyTorch team, FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention (2024, PyTorch blog / torch.nn.attention.flex_attention): compiler-generated fused kernels for attention variants with custom score modifications
Milakov & Gimelshein, Online normalizer calculation for softmax (2018, arXiv:1805.02867) for the online softmax recursion
Rabe & Staats, Self-attention Does Not Need $O(n^2)$ Memory (2021, arXiv:2112.05682), the memory-efficient attention algorithm that precedes FlashAttention
NVIDIA, H100 Tensor Core GPU Architecture Whitepaper (2022), Sections on SM SRAM, TMA, and FP16/FP8 tensor cores, for the hardware numbers cited above
Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapter 9 for the transformer attention background used by this page

Next Topics

Fused kernels: Flash Attention is itself a fused kernel; understanding kernel fusion explains why tiling helps

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

7

Softmax and Numerical Stabilitylayer 1 · tier 1
Attention Is All You Need (Paper)layer 4 · tier 1
Computer Architecture for MLlayer 2 · tier 2
Attention Mechanism Theorylayer 4 · tier 2
GPU Compute Modellayer 5 · tier 2

Derived topics

2

Fused Kernelslayer 5 · tier 2
Ring vs. Tree Attentionlayer 5 · tier 2

Graph-backed continuations

Fused Kernels