Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Algorithms Foundations

Matrix Multiplication Algorithms

From naive O(n^3) to Strassen's O(n^{2.807}) to the open question of the true exponent omega. What we know, what we do not, and why it matters for ML.

CoreTier 2Stable~50 min
0

Why This Matters

Every forward pass through a neural network is a sequence of matrix multiplications. Every backward pass is another sequence. Attention in transformers computes QKTQK^T and then multiplies by VV. Training a model on nn examples with dd features requires multiplying matrices of size n×dn \times d and d×kd \times k repeatedly.

The cost of matrix multiplication dominates the compute budget of modern ML. Any improvement to the exponent of matrix multiplication would propagate through every layer of every model.

Formal Setup

Definition

Matrix Multiplication

Given ARn×nA \in \mathbb{R}^{n \times n} and BRn×nB \in \mathbb{R}^{n \times n}, the product C=ABC = AB has entries:

Cij=k=1nAikBkjC_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}

The naive algorithm computes each of the n2n^2 entries using nn multiplications, giving O(n3)O(n^3) total arithmetic operations.

Definition

Matrix Multiplication Exponent

The matrix multiplication exponent ω\omega is the infimum of all α\alpha such that two n×nn \times n matrices can be multiplied using O(nα)O(n^\alpha) arithmetic operations. Formally:

ω=inf{α:two n×n matrices can be multiplied in O(nα) operations}\omega = \inf\{\alpha : \text{two } n \times n \text{ matrices can be multiplied in } O(n^\alpha) \text{ operations}\}

The naive algorithm gives ω3\omega \leq 3. The question is how much lower ω\omega can go.

The Naive Algorithm

The schoolbook method directly computes each CijC_{ij} from the definition. It uses n3n^3 multiplications and n3n2n^3 - n^2 additions, for a total of O(n3)O(n^3) operations. This is exactly what BLAS libraries implement (with heavy optimization for cache and SIMD), and it remains the practical standard.

Strassen's Algorithm

In 1969, Volker Strassen showed that you do not need 8 multiplications to multiply two 2×22 \times 2 matrices. Seven suffice.

Partition AA and BB into 2×22 \times 2 blocks:

A=(A11A12A21A22),B=(B11B12B21B22)A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix}, \quad B = \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{pmatrix}

Strassen defined 7 products M1,,M7M_1, \ldots, M_7 of specific linear combinations of these blocks, then recovered all four blocks of CC from additions and subtractions of the MiM_i.

Theorem

Strassen Algorithm Complexity

Statement

Two n×nn \times n matrices can be multiplied using O(nlog27)O(n^{\log_2 7}) arithmetic operations, where log272.807\log_2 7 \approx 2.807.

Intuition

The key idea: multiplying two 2×22 \times 2 block matrices normally requires 8 recursive multiplications. Strassen found a way to do it with 7, at the cost of more additions. Recursing on blocks of size n/2n/2 gives the recurrence T(n)=7T(n/2)+O(n2)T(n) = 7T(n/2) + O(n^2), which solves to T(n)=O(nlog27)T(n) = O(n^{\log_2 7}).

Proof Sketch

Write T(n)=7T(n/2)+cn2T(n) = 7T(n/2) + cn^2 for some constant cc. By the Master theorem, since 7>22=47 > 2^2 = 4, the solution is T(n)=Θ(nlog27)T(n) = \Theta(n^{\log_2 7}). The correctness of the 7 products follows by direct algebraic verification.

Why It Matters

Strassen's result was the first proof that ω<3\omega < 3. It shattered the assumption that O(n3)O(n^3) was optimal and launched the entire field of algebraic complexity theory for matrix multiplication.

Failure Mode

Strassen's algorithm is numerically less stable than the naive algorithm because it involves subtractions of similar-magnitude quantities. The constant factor is also larger, so it only wins for matrices larger than roughly n=500n = 500 to 10001000 (depending on implementation). In practice, GPU hardware is optimized for regular memory access patterns that favor the naive algorithm, and Strassen's irregular access pattern incurs significant overhead.

Beyond Strassen

The race to lower ω\omega continued after 1969:

YearAuthorsExponent
1969Strassen2.807
1978Pan2.796
1981Coppersmith, Winograd2.496
1990Coppersmith, Winograd2.376
2012Williams2.3729
2024Alman, Williams2.371552

All post-Strassen improvements use tensor rank methods and increasingly sophisticated algebraic techniques. The Coppersmith-Winograd approach and its extensions analyze the tensor structure of matrix multiplication and use the "laser method" to derive upper bounds.

The Lower Bound

Theorem

Lower Bound on Matrix Multiplication Exponent

Statement

Any algorithm that computes the product of two n×nn \times n matrices must perform at least Ω(n2)\Omega(n^2) operations. Therefore ω2\omega \geq 2.

Intuition

The output matrix has n2n^2 entries, and each must be computed. You cannot produce n2n^2 outputs with fewer than n2n^2 operations.

Proof Sketch

This is an information-theoretic argument. The output CC has n2n^2 entries. Any algorithm must at minimum read or write each entry at least once, requiring Ω(n2)\Omega(n^2) work.

Why It Matters

This tells us 2ω2.3715522 \leq \omega \leq 2.371552. The gap between the lower bound and the best upper bound is the central open problem in algebraic complexity theory.

Failure Mode

The Ω(n2)\Omega(n^2) lower bound only counts the number of operations, not the depth of the computation or memory access costs. Real hardware constraints (cache hierarchy, memory bandwidth) may impose higher practical lower bounds even if ω=2\omega = 2 in the algebraic model.

The Open Question: Is ω=2\omega = 2?

Nobody knows. There is no proof that ω>2\omega > 2, and there is no algorithm achieving O(n2)O(n^2). The conjecture that ω=2\omega = 2 is widely discussed but far from settled. Arguments in favor: the current upper bound keeps decreasing. Arguments against: all known improvements use the same family of techniques (the laser method), and there are barriers suggesting these techniques alone cannot reach ω=2\omega = 2.

If ω=2\omega = 2, then matrix multiplication costs no more than matrix addition up to logarithmic factors. This would have profound implications for all of numerical linear algebra.

Practical Reality

Despite decades of theoretical progress, the algorithms that actually run on hardware are:

  1. Naive (GEMM): O(n3)O(n^3), but with extraordinary constant-factor optimization. cuBLAS achieves near-peak FLOPS on GPUs.
  2. Strassen (rare): Used occasionally for very large dense matrices on CPUs. Almost never on GPUs due to irregular memory access.
  3. Winograd-class algorithms: Not implemented in any production system. The constants and numerical instability are prohibitive.

The gap between theory and practice exists because: (a) the O()O(\cdot) notation hides enormous constants; (b) modern hardware is optimized for regular, predictable memory access; (c) numerical stability matters for real computations.

Connection to ML

Matrix multiplication cost directly determines:

  • Forward pass cost: each linear layer computes Y=XWY = XW where XRb×dinX \in \mathbb{R}^{b \times d_{\text{in}}} and WRdin×doutW \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}
  • Attention cost: computing QKTQK^T for sequence length ss costs O(s2d)O(s^2 d)
  • Backward pass cost: gradient computation requires transposed matrix multiplications of the same dimensions
  • Training cost: scales linearly with the number of matrix multiplications per step, times the number of steps

A reduction in ω\omega from 3 to 2 would, in principle, reduce the cost of a forward pass from O(d3)O(d^3) to O(d2)O(d^2) per layer. In practice, the wall-clock benefit would be smaller due to constants and hardware constraints.

Common Confusions

Watch Out

Strassen is not used in practice for deep learning

You might expect that a faster algorithm would be adopted immediately. But GPU hardware (tensor cores, warp-level operations) is designed for dense, regular matrix multiplications. Strassen's irregular access pattern and additional memory requirements make it slower on real hardware for the matrix sizes typical in ML. The O(n2.807)O(n^{2.807}) exponent wins asymptotically, but "asymptotically" here means matrices larger than those encountered in practice.

Watch Out

The exponent omega describes arithmetic operations, not wall-clock time

When we say ω2.371\omega \leq 2.371, we mean the number of scalar multiplications and additions. Real execution time depends on memory bandwidth, cache behavior, parallelism, and numerical precision. An algorithm with exponent 2.5 but good cache behavior may outperform one with exponent 2.4 but terrible locality.

Key Takeaways

  • Naive matrix multiplication: O(n3)O(n^3). This is what runs in practice.
  • Strassen (1969): O(n2.807)O(n^{2.807}). First proof that ω<3\omega < 3.
  • Current best upper bound: ω2.371552\omega \leq 2.371552 (Alman-Williams, 2024).
  • Lower bound: ω2\omega \geq 2 (information-theoretic).
  • Whether ω=2\omega = 2 is a major open problem in computer science.
  • Theory and practice diverge: the fastest theoretical algorithms are not the fastest in practice.

Exercises

ExerciseCore

Problem

Verify that the naive matrix multiplication algorithm for two n×nn \times n matrices uses exactly n3n^3 multiplications and n2(n1)n^2(n-1) additions.

ExerciseCore

Problem

Strassen's algorithm reduces 8 multiplications to 7 for 2×22 \times 2 blocks. Apply the Master theorem to the recurrence T(n)=7T(n/2)+O(n2)T(n) = 7T(n/2) + O(n^2) to derive the exponent.

ExerciseAdvanced

Problem

Suppose someone discovers a way to multiply 3×33 \times 3 block matrices using only 21 multiplications (instead of the naive 27). What exponent would this yield via recursive application?

References

Canonical:

  • Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms (CLRS), Chapter 4.2 (Strassen)
  • Burgisser, Clausen, Shokrollahi, Algebraic Complexity Theory, Chapters 14-15

Current:

  • Alman & Williams, "A Refined Laser Method and Faster Matrix Multiplication" (2024)
  • Blaser, "Fast Matrix Multiplication" (2013), survey in Theory of Computing

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics