Strassen's Matrix Multiplication — From O(n³) to Sub-Cubic Algorithms
Multiplying two n×n matrices with the schoolbook algorithm requires n³ multiplications. In 1969, Volker Strassen showed that 2×2 matrices can be multiplied using only 7 scalar multiplications instead of 8, and that applying this recursively gives an algorithm running in O(nlog₂7) ≈ O(n2.807) — sub-cubic. For large matrices this is a massive computational saving that now underlies every serious numerical computing library from BLAS to deep learning frameworks.
1. The Naïve O(n³) Algorithm
Matrix multiplication C = A·B for n×n matrices: C[i][j] = Σₖ A[i][k] · B[k][j]. The three-loop implementation is O(n³) multiplications and O(n³) additions:
function matmul(A, B, n) {
const C = Array.from({length: n}, () => new Float64Array(n));
for (let i = 0; i < n; i++)
for (let k = 0; k < n; k++) // k-loop inner for cache
for (let j = 0; j < n; j++)
C[i][j] += A[i][k] * B[k][j];
return C;
}
Note the i-k-j loop order (outer-k-inner-j): this
accesses B row-by-row (cache-friendly) whereas
i-j-k accesses B column-by-column (cache-hostile). For n
= 1000, the naïve algorithm performs 10⁹ multiplications — requiring
about 1 second on a scalar CPU.
2. Divide-and-Conquer
Partition each n×n matrix into four (n/2)×(n/2) block matrices:
Splitting into 4 blocks and recursing is structurally equivalent to the naïve approach. The key insight in Strassen's work: reduce the number of recursive subproblems from 8 to 7.
3. Strassen's 7-Multiplication Trick
Strassen showed that the four products C₁₁, C₁₂, C₂₁, C₂₂ can be computed from just 7 intermediate products M₁…M₇:
These identities can be verified by direct expansion. The cost is 7 multiplications and 18 additions per level. Since additions are Θ(n²) and cheap compared to multiplications (on large matrices), the net result is faster overall.
4. Complexity Analysis
5. Cache-Oblivious Algorithms
Modern CPUs with L1/L2/L3 cache hierarchies penalize random memory access. A blocked (tiled) matrix multiplication improves data locality by operating on small B×B sub-blocks that fit entirely in L1 or L2 cache:
A cache-oblivious algorithm achieves optimal cache performance without knowing the cache size by recursively subdividing until the sub-problem fits in cache. The divide-and-conquer formulation of Strassen's is naturally cache-oblivious.
6. BLAS and Practical Implementation
The Basic Linear Algebra Subprograms (BLAS) standard defines the interface for matrix operations. DGEMM (double general matrix multiply) is the core routine:
- OpenBLAS, MKL, BLIS: Hand-tuned BLAS libraries using SIMD (AVX-512 does 16 double-precision multiplications per cycle), multi-threading, and blocking optimized for specific CPU cache sizes.
- LAPACK: Higher-level routines (LU factorization, eigenvalues) built on BLAS. LU factorization is O(n³) dominated by DGEMM calls.
- cuBLAS, rocBLAS: GPU BLAS leveraging thousands of CUDA/HIP cores. Peak GPU DGEMM typically achieves 30–100 TFLOPS for large matrices.
Deep learning matrix multiplication (batched, mixed-precision) is handled by cuBLAS-lt and Tensor Cores — specialized 4×4 matrix multiplication units in NVIDIA GPUs that achieve 2–8× higher throughput for FP16/BF16 operations, critical for transformer training.
7. The Race Toward ω = 2
The matrix multiplication exponent ω is defined as the infimum of all real numbers α such that two n×n matrices can be multiplied in O(n^α) field operations. Progress:
The Coppersmith-Winograd family of algorithms (and its descendants) use sophisticated algebraic machinery — tensor rank decomposition, laser method, group-theoretic techniques — producing algorithms so complex they are rarely implemented in practice. The crossover point over Strassen is astronomically large (n ~ 1010 or more). But theoretically, they confirm that matrix multiplication is closer to "linear" complexity than "cubic."