Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Algorithms Foundations

Open Problems in Matrix Computation

The unsolved questions in numerical linear algebra: the true exponent of matrix multiplication, practical fast algorithms, sparse matrix multiplication, randomized methods, and why these matter for scaling ML.

AdvancedTier 3Current~45 min
0

Why This Matters

Matrix computation is the bottleneck of ML training and inference. The largest language models spend the vast majority of their FLOPS on matrix multiplications. Any theoretical breakthrough in matrix algorithms would have immediate practical implications, yet several fundamental questions remain open after decades of research.

This page catalogs the major open problems, states what is known, and explains why each matters for ML.

Problem 1: The True Exponent of Matrix Multiplication

Status: Open since 1969.

Definition

Matrix Multiplication Exponent

ω=inf{α:two n×n matrices can be multiplied in O(nα) operations}\omega = \inf\{\alpha : \text{two } n \times n \text{ matrices can be multiplied in } O(n^\alpha) \text{ operations}\}.

Known bounds: 2ω2.3715522 \leq \omega \leq 2.371552.

The lower bound ω2\omega \geq 2 is trivial (you must at least read the output). The upper bound comes from Alman and Williams (2024). The gap between 2 and 2.371 remains.

What would resolve it: Either a proof that ω>2\omega > 2 (a super-quadratic lower bound for matrix multiplication) or an algorithm achieving O(n2+ϵ)O(n^{2+\epsilon}) for arbitrarily small ϵ\epsilon.

Barriers: All current upper bound techniques use the "laser method" and its extensions. There is evidence (the "group-theoretic" barrier) suggesting that these methods alone cannot reach ω=2\omega = 2. New ideas are needed.

Why it matters for ML: If ω=2\omega = 2, then in principle, the forward pass of a linear layer with dd input and dd output dimensions could cost O(d2)O(d^2) instead of O(d3)O(d^3). For a transformer with hidden dimension d=12288d = 12288 (GPT-4 scale), this is a factor of 1228812288 in the exponent difference.

Problem 2: A Practical O(n2+ϵ)O(n^{2+\epsilon}) Algorithm

Status: Open. No known algorithm with exponent below 2.807 (Strassen) is practical.

Proposition

The Practicality Barrier

Statement

All known matrix multiplication algorithms with exponent ω<2.807\omega < 2.807 have constants and memory access patterns that make them slower than the naive O(n3)O(n^3) algorithm for matrices of sizes encountered in practice (up to roughly n=105n = 10^5).

Intuition

The theoretical algorithms achieve lower exponents by doing more additions and subtractions on cleverly chosen linear combinations of matrix entries. This requires irregular memory access. Modern hardware (GPUs, tensor cores) achieves peak throughput only for regular, predictable access patterns. The theoretical savings in arithmetic are overwhelmed by the practical cost of memory irregularity.

Proof Sketch

This is an empirical observation, not a mathematical theorem. Benchmarks consistently show that highly optimized GEMM implementations (using tiling, SIMD, tensor cores) outperform Strassen-based implementations for matrices up to n104n \approx 10^4 on current hardware.

Why It Matters

The disconnect between theoretical complexity and practical performance is one of the most important open issues in numerical computing. Closing this gap would require either hardware that supports irregular access patterns efficiently or algorithms that achieve low exponents with regular access patterns.

Failure Mode

This "barrier" is hardware-dependent and may not persist as architectures evolve. Some research explores hardware co-design where the algorithm and chip are designed together. Also, for specific structured matrices (sparse, low-rank, Toeplitz), specialized algorithms already beat naive O(n3)O(n^3).

Problem 3: Sparse Matrix Multiplication

Given sparse matrices AA and BB with nnz(A)\text{nnz}(A) and nnz(B)\text{nnz}(B) nonzero entries respectively, can we multiply them faster?

Known results:

  • If the output C=ABC = AB has nnz(C)\text{nnz}(C) nonzeros, we need at least Ω(nnz(A)+nnz(B)+nnz(C))\Omega(\text{nnz}(A) + \text{nnz}(B) + \text{nnz}(C)) time
  • Current algorithms are far from this bound for general sparsity patterns
  • For specific sparsity structures (banded, block-sparse), specialized algorithms exist

Why it matters for ML: Sparse attention, sparse weight matrices (structured pruning), and graph neural networks all involve sparse matrix operations. Faster sparse matrix multiplication would directly speed up these methods.

Problem 4: Optimal Eigenvalue Computation

Status: The optimal complexity of computing all eigenvalues of an n×nn \times n matrix is not known precisely.

Known results:

  • The QR algorithm computes all eigenvalues in O(n3)O(n^3) operations (and O(n2)O(n^2) per iteration, with typically O(n)O(n) iterations needed)
  • For the top kk eigenvalues, Krylov methods (Lanczos, Arnoldi) cost O(knnziterations)O(k \cdot \text{nnz} \cdot \text{iterations}) where nnz\text{nnz} is the number of nonzeros
  • Can we compute all eigenvalues in O(nω)O(n^\omega)? This is open.

Connection to ML: PCA requires the top kk eigenvalues of XTXX^TX. Spectral clustering requires eigenvalues of the graph Laplacian. The cost of these operations limits the scalability of spectral methods.

Problem 5: Randomized Numerical Linear Algebra

Can we get "good enough" answers with random projections?

Theorem

Randomized SVD Approximation

Statement

The randomized SVD algorithm (Halko, Martinsson, Tropp, 2011) produces a rank-kk approximation A~\tilde{A} satisfying:

E[AA~F](1+kp1)1/2(j>kσj2)1/2\mathbb{E}[\|A - \tilde{A}\|_F] \leq \left(1 + \frac{k}{p-1}\right)^{1/2} \left(\sum_{j>k} \sigma_j^2\right)^{1/2}

in O(mnlogk+(m+n)k2)O(mn \log k + (m+n)k^2) time, where pp is the oversampling parameter and σj\sigma_j are the singular values of AA.

Intuition

Project the matrix onto a random low-dimensional subspace, then compute exact SVD in the reduced space. The random projection preserves the top singular directions with high probability if the oversampling parameter pp is large enough (typically p=5p = 5 or 1010 suffices).

Proof Sketch

The key step is showing that a random Gaussian matrix ΩRn×(k+p)\Omega \in \mathbb{R}^{n \times (k+p)} yields a subspace range(AΩ)\text{range}(A\Omega) that captures most of the energy in the top kk singular directions. This follows from concentration inequalities for random projections.

Why It Matters

For m,nkm, n \gg k (which is the typical case in ML), randomized SVD is dramatically faster than deterministic O(mnmin(m,n))O(mn \min(m,n)) SVD. It makes PCA on million-dimensional data feasible.

Failure Mode

The guarantee is in expectation and involves the tail singular values. If the spectrum does not decay (all singular values are similar), the approximation quality degrades. Also, the algorithm requires two passes over the data (or one pass with a slight accuracy loss), which matters for out-of-core computation.

Open questions in randomized NLA:

  • What are the optimal tradeoffs between number of random samples and approximation quality?
  • Can single-pass algorithms match multi-pass accuracy?
  • How robust are randomized methods to adversarial inputs?

Problem 6: Communication Complexity of Matrix Operations

Modern hardware is limited not by arithmetic (FLOPS are cheap) but by data movement (memory bandwidth is expensive). The communication complexity of matrix multiplication (how much data must be moved between levels of memory hierarchy) is partially understood.

Known: For dense matrix multiplication, the communication lower bound is Ω(n3/M)\Omega(n^3 / \sqrt{M}) words moved for cache size MM (Hong-Kung, 1981). GEMM implementations achieve this bound.

Open: What are the communication-optimal algorithms for structured matrix operations (sparse, banded, low-rank)?

Why it matters: On modern GPUs, arithmetic throughput exceeds memory bandwidth by 10x or more. Algorithms that minimize data movement, even at the cost of extra arithmetic, can be faster in practice.

The Bigger Picture

These problems share a common theme: the gap between what is theoretically possible and what is practically achievable in matrix computation. For ML, the practical algorithms (dense GEMM, cuBLAS, FlashAttention) already squeeze most of the performance from hardware. But the theoretical questions remain:

  • Can we do better in principle?
  • Can theoretical improvements be made practical?
  • What are the fundamental limits?

Common Confusions

Watch Out

Fast matrix multiplication would not automatically speed up all of ML

Even if ω=2\omega = 2, the constant factor matters. An O(n2)O(n^2) algorithm with constant 10610^6 is slower than O(n3)O(n^3) with constant 11 for all n<106n < 10^6. The practical benefit depends on whether the fast algorithm can be made cache-friendly and parallelizable.

Watch Out

Randomized NLA is approximate, not heuristic

Randomized SVD and randomized matrix multiplication have formal guarantees. They are not "just random guessing." The approximation error is bounded with high probability, and the bounds are often near-optimal.

Key Takeaways

  • ω[2,2.371]\omega \in [2, 2.371]: the true exponent of matrix multiplication is unknown
  • No sub-Strassen algorithm is practical on current hardware
  • Sparse matrix multiplication lacks tight complexity bounds for general sparsity
  • Optimal eigenvalue computation complexity is open
  • Randomized methods provide near-optimal approximations in reduced time
  • Communication complexity (data movement) often matters more than arithmetic complexity

Exercises

ExerciseAdvanced

Problem

Suppose you have a matrix AR106×103A \in \mathbb{R}^{10^6 \times 10^3} and want to compute the top 10 principal components. Compare the cost of: (a) exact SVD via ATAA^TA, (b) randomized SVD with oversampling p=10p = 10.

ExerciseResearch

Problem

Explain why finding ω=2\omega = 2 would not immediately make training large language models 1000x faster, even though it would reduce the asymptotic complexity of matrix multiplication from O(n3)O(n^3) to O(n2)O(n^2).

References

Canonical:

  • Golub & Van Loan, Matrix Computations (2013), comprehensive treatment of algorithms
  • Halko, Martinsson, Tropp, "Finding Structure with Randomness" (2011), SIAM Review

Current:

  • Woodruff, "Sketching as a Tool for Numerical Linear Algebra" (2014)
  • Martinsson & Tropp, "Randomized Numerical Linear Algebra" (2020), Acta Numerica

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics