Matrix Multiplication Algorithms

Sneiderman, Robby

Algorithms Foundations

Matrix Multiplication Algorithms

From naive cubic algorithms to Strassen's subcubic breakthrough to the open question of the true matrix multiplication exponent. What we know, what we do not, and why it matters for ML.

CoreTier 2CurrentSupporting~50 min

Prerequisites

Vectors Matrices and Linear Maps

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

algorithms-foundations | layer 1 | tier 2. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Open Problems in Matrix Computation

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every forward pass through a neural network is a sequence of matrix multiplications. Every backward pass is another sequence. Attention in transformers computes $QK^T$ and then multiplies by $V$ . Training a model on $n$ examples with $d$ features requires multiplying matrices of size $n \times d$ and $d \times k$ repeatedly.

The cost of matrix multiplication dominates the compute budget of modern ML. Any improvement to the exponent of matrix multiplication would propagate through every layer of every model.

Formal Setup

Definition

Matrix Multiplication $C = A B$

Given $A \in \mathbb{R}^{n \times n}$ and $B \in \mathbb{R}^{n \times n}$ , the product $C = AB$ has entries:

$C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}$

The naive algorithm computes each of the $n^2$ entries using $n$ multiplications, giving $O(n^3)$ total arithmetic operations.

Definition

Matrix Multiplication Exponent $ω$

The matrix multiplication exponent $\omega$ is the infimum of all $\alpha$ such that two $n \times n$ matrices can be multiplied using $O(n^\alpha)$ arithmetic operations. Formally:

$\omega = \inf\{\alpha : \text{two } n \times n \text{ matrices can be multiplied in } O(n^\alpha) \text{ operations}\}$

The naive algorithm gives $\omega \leq 3$ . The question is how much lower $\omega$ can go.

The Naive Algorithm

The schoolbook method directly computes each $C_{ij}$ from the definition. It uses $n^3$ multiplications and $n^3 - n^2$ additions, for a total of $O(n^3)$ operations. This is exactly what BLAS libraries implement (with heavy optimization for cache and SIMD), and it remains the practical standard.

Strassen's Algorithm

In 1969, Volker Strassen showed that you do not need 8 multiplications to multiply two $2 \times 2$ matrices. Seven suffice.

Partition $A$ and $B$ into $2 \times 2$ blocks:

$A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix}, \quad B = \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{pmatrix}$

Strassen defined 7 products $M_1, \ldots, M_7$ of specific linear combinations of these blocks, then recovered all four blocks of $C$ from additions and subtractions of the $M_i$ .

Theorem

Strassen Algorithm Complexity

Statement

Two $n \times n$ matrices can be multiplied using $O(n^{\log_2 7})$ arithmetic operations, where $\log_2 7 \approx 2.807$ .

Intuition

The key idea: multiplying two $2 \times 2$ block matrices normally requires 8 recursive multiplications. Strassen found a way to do it with 7, at the cost of more additions. Recursing on blocks of size $n/2$ gives the recurrence $T(n) = 7T(n/2) + O(n^2)$ , which solves to $T(n) = O(n^{\log_2 7})$ .

Proof Sketch

Write $T(n) = 7T(n/2) + cn^2$ for some constant $c$ . By the Master theorem, since $7 > 2^2 = 4$ , the solution is $T(n) = \Theta(n^{\log_2 7})$ . The correctness of the 7 products follows by direct algebraic verification.

Why It Matters

Strassen's result was the first proof that $\omega < 3$ . It shattered the assumption that $O(n^3)$ was optimal and launched the entire field of algebraic complexity theory for matrix multiplication.

Failure Mode

Strassen's algorithm is numerically less stable than the naive algorithm because it involves subtractions of similar-magnitude quantities. The constant factor is also larger, so it only wins for matrices larger than roughly $n = 500$ to $1000$ (depending on implementation). In practice, GPU hardware is optimized for regular memory access patterns that favor the naive algorithm, and Strassen's irregular access pattern incurs significant overhead.

report a correction →

Beyond Strassen

The race to lower $\omega$ continued after 1969:

Year	Authors	Exponent
1969	Strassen	2.807
1978	Pan	2.796
1981	Coppersmith, Winograd	2.496
1990	Coppersmith, Winograd	2.376
2012	Williams	2.3729
2024	Alman, Williams	2.371552

All post-Strassen improvements use tensor rank methods and increasingly sophisticated algebraic techniques. The Coppersmith-Winograd approach and its extensions analyze the tensor structure of matrix multiplication and use the "laser method" to derive upper bounds. Ambainis, Filmus, and Le Gall (2015) showed that the laser method itself has a barrier: any algorithm derived purely from the original Coppersmith-Winograd tensor by laser-method analysis is bounded below by an exponent strictly greater than 2. Reaching $\omega = 2$ will require new tensors or new techniques, not further refinement of the existing ones.

Machine-Discovered Algorithms

In 2022, DeepMind's AlphaTensor (Fawzi et al., Nature) used reinforcement learning to search for low-rank decompositions of the matrix multiplication tensor. AlphaTensor rediscovered Strassen's $2 \times 2$ scheme and found new decompositions for several block sizes, including a $4 \times 5$ by $5 \times 5$ multiplication using 76 scalar multiplications (improving the previously known 80). These results match or beat the best human-designed algorithms for several specific small block sizes, particularly over $\mathbb{F}_2$ (modular arithmetic).

The implications are mixed. AlphaTensor's wins are concrete but limited: most discovered schemes apply to specific block shapes and arithmetic settings, and none of them improves the asymptotic exponent $\omega$ . They do, however, expand the search space of practical algorithms in regimes where the current bounds (Strassen, Pan, Smirnov) had stood for decades. Follow-up work (AlphaEvolve and related search systems) has continued to find domain-specific improvements through 2024-2025.

The Lower Bound

Theorem

Lower Bound on Matrix Multiplication Exponent

Statement

Any algorithm that computes the product of two $n \times n$ matrices must perform at least $\Omega(n^2)$ operations. Therefore $\omega \geq 2$ .

Intuition

The output matrix has $n^2$ entries, and each must be computed. You cannot produce $n^2$ outputs with fewer than $n^2$ operations.

Proof Sketch

This is an information-theoretic argument. The output $C$ has $n^2$ entries. Any algorithm must at minimum read or write each entry at least once, requiring $\Omega(n^2)$ work.

Why It Matters

This tells us $2 \leq \omega \leq 2.371552$ . The gap between the lower bound and the best upper bound is the central open problem in algebraic complexity theory.

Failure Mode

The $\Omega(n^2)$ lower bound only counts the number of operations, not the depth of the computation or memory access costs. Real hardware constraints (cache hierarchy, memory bandwidth) may impose higher practical lower bounds even if $\omega = 2$ in the algebraic model.

report a correction →

The Open Question: Is $\omega = 2$ ?

Nobody knows. There is no proof that $\omega > 2$ , and there is no algorithm achieving $O(n^2)$ . The conjecture that $\omega = 2$ is widely discussed but far from settled. Arguments in favor: the current upper bound keeps decreasing. Arguments against: all known improvements use the same family of techniques (the laser method), and there are barriers suggesting these techniques alone cannot reach $\omega = 2$ .

If $\omega = 2$ , then matrix multiplication costs no more than matrix addition up to logarithmic factors. This would have profound implications for all of numerical linear algebra.

Practical Reality

Despite decades of theoretical progress, the algorithms that actually run on hardware are:

Naive (GEMM): $O(n^3)$ , but with extraordinary constant-factor optimization. cuBLAS achieves near-peak FLOPS on GPUs.
Strassen (rare): Used occasionally for very large dense matrices on CPUs. Almost never on GPUs due to irregular memory access.
Winograd-class algorithms: Not implemented in any production system. The constants and numerical instability are prohibitive.

The gap between theory and practice exists because: (a) the $O(\cdot)$ notation hides enormous constants; (b) modern hardware is optimized for regular, predictable memory access; (c) numerical stability matters for real computations.

Connection to ML

Matrix multiplication cost directly determines:

Forward pass cost: each linear layer computes $Y = XW$ where $X \in \mathbb{R}^{b \times d_{\text{in}}}$ and $W \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$
Attention cost: computing $QK^T$ for sequence length $s$ costs $O(s^2 d)$
Backward pass cost: gradient computation requires transposed matrix multiplications of the same dimensions
Training cost: scales linearly with the number of matrix multiplications per step, times the number of steps

A reduction in $\omega$ from 3 to 2 would, in principle, reduce the cost of a forward pass from $O(d^3)$ to $O(d^2)$ per layer. In practice, the wall-clock benefit would be smaller due to constants and hardware constraints.

Common Confusions

Watch Out

Strassen is not used in practice for deep learning

You might expect that a faster algorithm would be adopted immediately. But GPU hardware (tensor cores, warp-level operations) is designed for dense, regular matrix multiplications. Strassen's irregular access pattern and additional memory requirements make it slower on real hardware for the matrix sizes typical in ML. The $O(n^{2.807})$ exponent wins asymptotically, but "asymptotically" here means matrices larger than those encountered in practice.

Watch Out

The exponent omega describes arithmetic operations, not wall-clock time

When we say $\omega \leq 2.371$ , we mean the number of scalar multiplications and additions. Real execution time depends on memory bandwidth, cache behavior, parallelism, and numerical precision. An algorithm with exponent 2.5 but good cache behavior may outperform one with exponent 2.4 but terrible locality.

Summary

Naive matrix multiplication: $O(n^3)$ . This is what runs in practice.
Strassen (1969): $O(n^{2.807})$ . First proof that $\omega < 3$ .
Current best upper bound: $\omega \leq 2.371552$ (Alman-Williams, 2024).
Lower bound: $\omega \geq 2$ (information-theoretic).
Whether $\omega = 2$ is a major open problem in computer science.
Theory and practice diverge: the fastest theoretical algorithms are not the fastest in practice.

Exercises

ExerciseCore

Problem

Verify that the naive matrix multiplication algorithm for two $n \times n$ matrices uses exactly $n^3$ multiplications and $n^2(n-1)$ additions.

ExerciseCore

Problem

Strassen's algorithm reduces 8 multiplications to 7 for $2 \times 2$ blocks. Apply the Master theorem to the recurrence $T(n) = 7T(n/2) + O(n^2)$ to derive the exponent.

ExerciseAdvanced

Problem

Suppose someone discovers a way to multiply $3 \times 3$ block matrices using only 21 multiplications (instead of the naive 27). What exponent would this yield via recursive application?

References

Canonical:

Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms (CLRS), Chapter 4.2 (Strassen)
Burgisser, Clausen, Shokrollahi, Algebraic Complexity Theory, Chapters 14-15
Pan, "Strassen's Algorithm Is Not Optimal" (FOCS 1978), $3 \times 3$ block scheme
Higham, Accuracy and Stability of Numerical Algorithms (SIAM, 2nd ed. 2002), Chapter 23 on numerical analysis of fast matrix multiplication

Current:

Alman & Williams, "A Refined Laser Method and Faster Matrix Multiplication" (SODA 2024). Current best upper bound $\omega \leq 2.371552$ .
Le Gall, "Powers of Tensors and Fast Matrix Multiplication" (ISSAC 2014). The 2014 framework underlying subsequent improvements.
Williams, "Multiplying Matrices Faster Than Coppersmith-Winograd" (STOC 2012).
Ambainis, Filmus, Le Gall, "Fast Matrix Multiplication: Limitations of the Laser Method" (STOC 2015). The barrier result.
Fawzi et al., "Discovering Faster Matrix Multiplication Algorithms with Reinforcement Learning" (Nature 2022). AlphaTensor.
Blaser, "Fast Matrix Multiplication" (2013), survey in Theory of Computing

Next Topics

Open problems in matrix computation: the broader landscape of unsolved questions
Gram matrices and kernel matrices: where matrix multiplication meets ML directly

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Vectors, Matrices, and Linear Mapslayer 0A · tier 1

Derived topics

2

Gram Matrices and Kernel Matriceslayer 1 · tier 1
Open Problems in Matrix Computationlayer 3 · tier 3

Graph-backed continuations

Open Problems in Matrix Computation Gram Matrices and Kernel Matrices

Read this page in the graph.

Why This Matters

Formal Setup

The Naive Algorithm

Strassen's Algorithm

Beyond Strassen

Machine-Discovered Algorithms

The Lower Bound

The Open Question: Is ω=2\omega = 2ω=2?

Practical Reality

Connection to ML

Common Confusions

Summary

Exercises

References

Next Topics

Required before and derived from this topic

Required prerequisites

Derived topics

The Open Question: Is $\omega = 2$ ?