Algorithms Foundations
Open Problems in Matrix Computation
The unsolved questions in numerical linear algebra: the true exponent of matrix multiplication, practical fast algorithms, sparse matrix multiplication, randomized methods, and why these matter for scaling ML.
Why This Matters
Matrix computation is the bottleneck of ML training and inference. The largest language models spend the vast majority of their FLOPS on matrix multiplications. Any theoretical breakthrough in matrix algorithms would have immediate practical implications, yet several fundamental questions remain open after decades of research.
This page catalogs the major open problems, states what is known, and explains why each matters for ML.
Problem 1: The True Exponent of Matrix Multiplication
Status: Open since 1969.
Matrix Multiplication Exponent
.
Known bounds: .
The lower bound is trivial (you must at least read the output). The upper bound comes from Alman and Williams (2024). The gap between 2 and 2.371 remains.
What would resolve it: Either a proof that (a super-quadratic lower bound for matrix multiplication) or an algorithm achieving for arbitrarily small .
Barriers: All current upper bound techniques use the "laser method" and its extensions. There is evidence (the "group-theoretic" barrier) suggesting that these methods alone cannot reach . New ideas are needed.
Why it matters for ML: If , then in principle, the forward pass of a linear layer with input and output dimensions could cost instead of . For a transformer with hidden dimension (GPT-4 scale), this is a factor of in the exponent difference.
Problem 2: A Practical Algorithm
Status: Open. No known algorithm with exponent below 2.807 (Strassen) is practical.
The Practicality Barrier
Statement
All known matrix multiplication algorithms with exponent have constants and memory access patterns that make them slower than the naive algorithm for matrices of sizes encountered in practice (up to roughly ).
Intuition
The theoretical algorithms achieve lower exponents by doing more additions and subtractions on cleverly chosen linear combinations of matrix entries. This requires irregular memory access. Modern hardware (GPUs, tensor cores) achieves peak throughput only for regular, predictable access patterns. The theoretical savings in arithmetic are overwhelmed by the practical cost of memory irregularity.
Proof Sketch
This is an empirical observation, not a mathematical theorem. Benchmarks consistently show that highly optimized GEMM implementations (using tiling, SIMD, tensor cores) outperform Strassen-based implementations for matrices up to on current hardware.
Why It Matters
The disconnect between theoretical complexity and practical performance is one of the most important open issues in numerical computing. Closing this gap would require either hardware that supports irregular access patterns efficiently or algorithms that achieve low exponents with regular access patterns.
Failure Mode
This "barrier" is hardware-dependent and may not persist as architectures evolve. Some research explores hardware co-design where the algorithm and chip are designed together. Also, for specific structured matrices (sparse, low-rank, Toeplitz), specialized algorithms already beat naive .
Problem 3: Sparse Matrix Multiplication
Given sparse matrices and with and nonzero entries respectively, can we multiply them faster?
Known results:
- If the output has nonzeros, we need at least time
- Current algorithms are far from this bound for general sparsity patterns
- For specific sparsity structures (banded, block-sparse), specialized algorithms exist
Why it matters for ML: Sparse attention, sparse weight matrices (structured pruning), and graph neural networks all involve sparse matrix operations. Faster sparse matrix multiplication would directly speed up these methods.
Problem 4: Optimal Eigenvalue Computation
Status: The optimal complexity of computing all eigenvalues of an matrix is not known precisely.
Known results:
- The QR algorithm computes all eigenvalues in operations (and per iteration, with typically iterations needed)
- For the top eigenvalues, Krylov methods (Lanczos, Arnoldi) cost where is the number of nonzeros
- Can we compute all eigenvalues in ? This is open.
Connection to ML: PCA requires the top eigenvalues of . Spectral clustering requires eigenvalues of the graph Laplacian. The cost of these operations limits the scalability of spectral methods.
Problem 5: Randomized Numerical Linear Algebra
Can we get "good enough" answers with random projections?
Randomized SVD Approximation
Statement
The randomized SVD algorithm (Halko, Martinsson, Tropp, 2011) produces a rank- approximation satisfying:
in time, where is the oversampling parameter and are the singular values of .
Intuition
Project the matrix onto a random low-dimensional subspace, then compute exact SVD in the reduced space. The random projection preserves the top singular directions with high probability if the oversampling parameter is large enough (typically or suffices).
Proof Sketch
The key step is showing that a random Gaussian matrix yields a subspace that captures most of the energy in the top singular directions. This follows from concentration inequalities for random projections.
Why It Matters
For (which is the typical case in ML), randomized SVD is dramatically faster than deterministic SVD. It makes PCA on million-dimensional data feasible.
Failure Mode
The guarantee is in expectation and involves the tail singular values. If the spectrum does not decay (all singular values are similar), the approximation quality degrades. Also, the algorithm requires two passes over the data (or one pass with a slight accuracy loss), which matters for out-of-core computation.
Open questions in randomized NLA:
- What are the optimal tradeoffs between number of random samples and approximation quality?
- Can single-pass algorithms match multi-pass accuracy?
- How robust are randomized methods to adversarial inputs?
Problem 6: Communication Complexity of Matrix Operations
Modern hardware is limited not by arithmetic (FLOPS are cheap) but by data movement (memory bandwidth is expensive). The communication complexity of matrix multiplication (how much data must be moved between levels of memory hierarchy) is partially understood.
Known: For dense matrix multiplication, the communication lower bound is words moved for cache size (Hong-Kung, 1981). GEMM implementations achieve this bound.
Open: What are the communication-optimal algorithms for structured matrix operations (sparse, banded, low-rank)?
Why it matters: On modern GPUs, arithmetic throughput exceeds memory bandwidth by 10x or more. Algorithms that minimize data movement, even at the cost of extra arithmetic, can be faster in practice.
The Bigger Picture
These problems share a common theme: the gap between what is theoretically possible and what is practically achievable in matrix computation. For ML, the practical algorithms (dense GEMM, cuBLAS, FlashAttention) already squeeze most of the performance from hardware. But the theoretical questions remain:
- Can we do better in principle?
- Can theoretical improvements be made practical?
- What are the fundamental limits?
Common Confusions
Fast matrix multiplication would not automatically speed up all of ML
Even if , the constant factor matters. An algorithm with constant is slower than with constant for all . The practical benefit depends on whether the fast algorithm can be made cache-friendly and parallelizable.
Randomized NLA is approximate, not heuristic
Randomized SVD and randomized matrix multiplication have formal guarantees. They are not "just random guessing." The approximation error is bounded with high probability, and the bounds are often near-optimal.
Key Takeaways
- : the true exponent of matrix multiplication is unknown
- No sub-Strassen algorithm is practical on current hardware
- Sparse matrix multiplication lacks tight complexity bounds for general sparsity
- Optimal eigenvalue computation complexity is open
- Randomized methods provide near-optimal approximations in reduced time
- Communication complexity (data movement) often matters more than arithmetic complexity
Exercises
Problem
Suppose you have a matrix and want to compute the top 10 principal components. Compare the cost of: (a) exact SVD via , (b) randomized SVD with oversampling .
Problem
Explain why finding would not immediately make training large language models 1000x faster, even though it would reduce the asymptotic complexity of matrix multiplication from to .
References
Canonical:
- Golub & Van Loan, Matrix Computations (2013), comprehensive treatment of algorithms
- Halko, Martinsson, Tropp, "Finding Structure with Randomness" (2011), SIAM Review
Current:
- Woodruff, "Sketching as a Tool for Numerical Linear Algebra" (2014)
- Martinsson & Tropp, "Randomized Numerical Linear Algebra" (2020), Acta Numerica
Next Topics
- Unsolved problems in computer science: the broader landscape including P vs NP
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Matrix Multiplication AlgorithmsLayer 1
- Vectors, Matrices, and Linear MapsLayer 0A
- Eigenvalues and EigenvectorsLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A