Algorithms Foundations
Matrix Multiplication Algorithms
From naive O(n^3) to Strassen's O(n^{2.807}) to the open question of the true exponent omega. What we know, what we do not, and why it matters for ML.
Prerequisites
Why This Matters
Every forward pass through a neural network is a sequence of matrix multiplications. Every backward pass is another sequence. Attention in transformers computes and then multiplies by . Training a model on examples with features requires multiplying matrices of size and repeatedly.
The cost of matrix multiplication dominates the compute budget of modern ML. Any improvement to the exponent of matrix multiplication would propagate through every layer of every model.
Formal Setup
Matrix Multiplication
Given and , the product has entries:
The naive algorithm computes each of the entries using multiplications, giving total arithmetic operations.
Matrix Multiplication Exponent
The matrix multiplication exponent is the infimum of all such that two matrices can be multiplied using arithmetic operations. Formally:
The naive algorithm gives . The question is how much lower can go.
The Naive Algorithm
The schoolbook method directly computes each from the definition. It uses multiplications and additions, for a total of operations. This is exactly what BLAS libraries implement (with heavy optimization for cache and SIMD), and it remains the practical standard.
Strassen's Algorithm
In 1969, Volker Strassen showed that you do not need 8 multiplications to multiply two matrices. Seven suffice.
Partition and into blocks:
Strassen defined 7 products of specific linear combinations of these blocks, then recovered all four blocks of from additions and subtractions of the .
Strassen Algorithm Complexity
Statement
Two matrices can be multiplied using arithmetic operations, where .
Intuition
The key idea: multiplying two block matrices normally requires 8 recursive multiplications. Strassen found a way to do it with 7, at the cost of more additions. Recursing on blocks of size gives the recurrence , which solves to .
Proof Sketch
Write for some constant . By the Master theorem, since , the solution is . The correctness of the 7 products follows by direct algebraic verification.
Why It Matters
Strassen's result was the first proof that . It shattered the assumption that was optimal and launched the entire field of algebraic complexity theory for matrix multiplication.
Failure Mode
Strassen's algorithm is numerically less stable than the naive algorithm because it involves subtractions of similar-magnitude quantities. The constant factor is also larger, so it only wins for matrices larger than roughly to (depending on implementation). In practice, GPU hardware is optimized for regular memory access patterns that favor the naive algorithm, and Strassen's irregular access pattern incurs significant overhead.
Beyond Strassen
The race to lower continued after 1969:
| Year | Authors | Exponent |
|---|---|---|
| 1969 | Strassen | 2.807 |
| 1978 | Pan | 2.796 |
| 1981 | Coppersmith, Winograd | 2.496 |
| 1990 | Coppersmith, Winograd | 2.376 |
| 2012 | Williams | 2.3729 |
| 2024 | Alman, Williams | 2.371552 |
All post-Strassen improvements use tensor rank methods and increasingly sophisticated algebraic techniques. The Coppersmith-Winograd approach and its extensions analyze the tensor structure of matrix multiplication and use the "laser method" to derive upper bounds.
The Lower Bound
Lower Bound on Matrix Multiplication Exponent
Statement
Any algorithm that computes the product of two matrices must perform at least operations. Therefore .
Intuition
The output matrix has entries, and each must be computed. You cannot produce outputs with fewer than operations.
Proof Sketch
This is an information-theoretic argument. The output has entries. Any algorithm must at minimum read or write each entry at least once, requiring work.
Why It Matters
This tells us . The gap between the lower bound and the best upper bound is the central open problem in algebraic complexity theory.
Failure Mode
The lower bound only counts the number of operations, not the depth of the computation or memory access costs. Real hardware constraints (cache hierarchy, memory bandwidth) may impose higher practical lower bounds even if in the algebraic model.
The Open Question: Is ?
Nobody knows. There is no proof that , and there is no algorithm achieving . The conjecture that is widely discussed but far from settled. Arguments in favor: the current upper bound keeps decreasing. Arguments against: all known improvements use the same family of techniques (the laser method), and there are barriers suggesting these techniques alone cannot reach .
If , then matrix multiplication costs no more than matrix addition up to logarithmic factors. This would have profound implications for all of numerical linear algebra.
Practical Reality
Despite decades of theoretical progress, the algorithms that actually run on hardware are:
- Naive (GEMM): , but with extraordinary constant-factor optimization. cuBLAS achieves near-peak FLOPS on GPUs.
- Strassen (rare): Used occasionally for very large dense matrices on CPUs. Almost never on GPUs due to irregular memory access.
- Winograd-class algorithms: Not implemented in any production system. The constants and numerical instability are prohibitive.
The gap between theory and practice exists because: (a) the notation hides enormous constants; (b) modern hardware is optimized for regular, predictable memory access; (c) numerical stability matters for real computations.
Connection to ML
Matrix multiplication cost directly determines:
- Forward pass cost: each linear layer computes where and
- Attention cost: computing for sequence length costs
- Backward pass cost: gradient computation requires transposed matrix multiplications of the same dimensions
- Training cost: scales linearly with the number of matrix multiplications per step, times the number of steps
A reduction in from 3 to 2 would, in principle, reduce the cost of a forward pass from to per layer. In practice, the wall-clock benefit would be smaller due to constants and hardware constraints.
Common Confusions
Strassen is not used in practice for deep learning
You might expect that a faster algorithm would be adopted immediately. But GPU hardware (tensor cores, warp-level operations) is designed for dense, regular matrix multiplications. Strassen's irregular access pattern and additional memory requirements make it slower on real hardware for the matrix sizes typical in ML. The exponent wins asymptotically, but "asymptotically" here means matrices larger than those encountered in practice.
The exponent omega describes arithmetic operations, not wall-clock time
When we say , we mean the number of scalar multiplications and additions. Real execution time depends on memory bandwidth, cache behavior, parallelism, and numerical precision. An algorithm with exponent 2.5 but good cache behavior may outperform one with exponent 2.4 but terrible locality.
Key Takeaways
- Naive matrix multiplication: . This is what runs in practice.
- Strassen (1969): . First proof that .
- Current best upper bound: (Alman-Williams, 2024).
- Lower bound: (information-theoretic).
- Whether is a major open problem in computer science.
- Theory and practice diverge: the fastest theoretical algorithms are not the fastest in practice.
Exercises
Problem
Verify that the naive matrix multiplication algorithm for two matrices uses exactly multiplications and additions.
Problem
Strassen's algorithm reduces 8 multiplications to 7 for blocks. Apply the Master theorem to the recurrence to derive the exponent.
Problem
Suppose someone discovers a way to multiply block matrices using only 21 multiplications (instead of the naive 27). What exponent would this yield via recursive application?
References
Canonical:
- Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms (CLRS), Chapter 4.2 (Strassen)
- Burgisser, Clausen, Shokrollahi, Algebraic Complexity Theory, Chapters 14-15
Current:
- Alman & Williams, "A Refined Laser Method and Faster Matrix Multiplication" (2024)
- Blaser, "Fast Matrix Multiplication" (2013), survey in Theory of Computing
Next Topics
- Open problems in matrix computation: the broader landscape of unsolved questions
- Gram matrices and kernel matrices: where matrix multiplication meets ML directly
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.