Strassen's Matrix Multiplication — From O(n³) to Sub-Cubic Algorithms

Multiplying two n×n matrices with the schoolbook algorithm requires n³ multiplications. In 1969, Volker Strassen showed that 2×2 matrices can be multiplied using only 7 scalar multiplications instead of 8, and that applying this recursively gives an algorithm running in O(n^log₂7) ≈ O(n^2.807) — sub-cubic. For large matrices this is a massive computational saving that now underlies every serious numerical computing library from BLAS to deep learning frameworks.

1. The Naïve O(n³) Algorithm

Matrix multiplication C = A·B for n×n matrices: C[i][j] = Σₖ A[i][k] · B[k][j]. The three-loop implementation is O(n³) multiplications and O(n³) additions:

function matmul(A, B, n) {
  const C = Array.from({length: n}, () => new Float64Array(n));
  for (let i = 0; i < n; i++)
    for (let k = 0; k < n; k++)             // k-loop inner for cache
      for (let j = 0; j < n; j++)
        C[i][j] += A[i][k] * B[k][j];
  return C;
}

Note the i-k-j loop order (outer-k-inner-j): this accesses B row-by-row (cache-friendly) whereas i-j-k accesses B column-by-column (cache-hostile). For n = 1000, the naïve algorithm performs 10⁹ multiplications — requiring about 1 second on a scalar CPU.

2. Divide-and-Conquer

Partition each n×n matrix into four (n/2)×(n/2) block matrices:

A = | A₁₁ A₁₂ | B = | B₁₁ B₁₂ | C = | C₁₁ C₁₂ | | A₂₁ A₂₂ | | B₂₁ B₂₂ | | C₂₁ C₂₂ | C₁₁ = A₁₁\cdotB₁₁ + A₁₂\cdotB₂₁ C₁₂ = A₁₁\cdotB₁₂ + A₁₂\cdotB₂₂ C₂₁ = A₂₁\cdotB₁₁ + A₂₂\cdotB₂₁ C₂₂ = A₂₁\cdotB₁₂ + A₂₂\cdotB₂₂ \to 8 recursive multiplications of (n/2)\times(n/2) matrices \to 4 matrix additions (each O((n/2)²)) Recurrence: T(n) = 8T(n/2) + O(n²) Solution: T(n) = O(n^log₂8) = O(n³) \leftarrow no improvement yet!

Splitting into 4 blocks and recursing is structurally equivalent to the naïve approach. The key insight in Strassen's work: reduce the number of recursive subproblems from 8 to 7.

3. Strassen's 7-Multiplication Trick

Strassen showed that the four products C₁₁, C₁₂, C₂₁, C₂₂ can be computed from just 7 intermediate products M₁…M₇:

M₁ = (A₁₁ + A₂₂)(B₁₁ + B₂₂) M₂ = (A₂₁ + A₂₂) B₁₁ M₃ = A₁₁ (B₁₂ - B₂₂) M₄ = A₂₂ (B₂₁ - B₁₁) M₅ = (A₁₁ + A₁₂) B₂₂ M₆ = (A₂₁ - A₁₁)(B₁₁ + B₁₂) M₇ = (A₁₂ - A₂₂)(B₂₁ + B₂₂) C₁₁ = M₁ + M₄ - M₅ + M₇ C₁₂ = M₃ + M₅ C₂₁ = M₂ + M₄ C₂₂ = M₁ - M₂ + M₃ + M₆

These identities can be verified by direct expansion. The cost is 7 multiplications and 18 additions per level. Since additions are Θ(n²) and cheap compared to multiplications (on large matrices), the net result is faster overall.

4. Complexity Analysis

Recurrence: T(n) = 7\cdotT(n/2) + O(n²) Master theorem (case 1: log_b(a) = log₂7 \approx 2.807 > 2): T(n) = Θ(n^log₂7) \approx Θ(n^2.807) Savings over naïve at n = 4096: Naïve: 4096³ = 68.7 billion multiplications Strassen: 4096^2.807 \approx 15.5 billion multiplications Speedup: ~4.4\times At n = 16384: Naïve: 4.4 \times 10¹² Strassen: 3.5 \times 10¹¹ Speedup: ~12.7\times Crossover point (where Strassen beats naïve, counting adds): Practical threshold: n \approx 256-512 depending on hardware

Numerical stability: Strassen's algorithm performs more additions than the naïve algorithm, slightly amplifying floating-point rounding errors. For double precision, error growth is O(n^0.585) larger — acceptable for most applications but relevant in ill-conditioned systems.

5. Cache-Oblivious Algorithms

Modern CPUs with L1/L2/L3 cache hierarchies penalize random memory access. A blocked (tiled) matrix multiplication improves data locality by operating on small B×B sub-blocks that fit entirely in L1 or L2 cache:

Blocked matmul (block size B): for ii = 0..n step B: for kk = 0..n step B: for jj = 0..n step B: // multiply A[ii:ii+B][kk:kk+B] \times B[kk:kk+B][jj:jj+B] for i = ii..ii+B: for k = kk..kk+B: for j = jj..jj+B: C[i][j] += A[i][k] * B[k][j]; Cache misses: Θ(n³ / B) vs Θ(n³) for naïve Choose B = \sqrt(M/3) where M is L1 cache size in elements.

A cache-oblivious algorithm achieves optimal cache performance without knowing the cache size by recursively subdividing until the sub-problem fits in cache. The divide-and-conquer formulation of Strassen's is naturally cache-oblivious.

6. BLAS and Practical Implementation

The Basic Linear Algebra Subprograms (BLAS) standard defines the interface for matrix operations. DGEMM (double general matrix multiply) is the core routine:

OpenBLAS, MKL, BLIS: Hand-tuned BLAS libraries using SIMD (AVX-512 does 16 double-precision multiplications per cycle), multi-threading, and blocking optimized for specific CPU cache sizes.
LAPACK: Higher-level routines (LU factorization, eigenvalues) built on BLAS. LU factorization is O(n³) dominated by DGEMM calls.
cuBLAS, rocBLAS: GPU BLAS leveraging thousands of CUDA/HIP cores. Peak GPU DGEMM typically achieves 30–100 TFLOPS for large matrices.

Deep learning matrix multiplication (batched, mixed-precision) is handled by cuBLAS-lt and Tensor Cores — specialized 4×4 matrix multiplication units in NVIDIA GPUs that achieve 2–8× higher throughput for FP16/BF16 operations, critical for transformer training.

7. The Race Toward ω = 2

The matrix multiplication exponent ω is defined as the infimum of all real numbers α such that two n×n matrices can be multiplied in O(n^α) field operations. Progress:

1969: Strassen ω \leq log₂7 \approx 2.8074 1979: Pan ω \leq 2.796 1981: Schönhage ω \leq 2.548 1987: Coppersmith-Winograd ω \leq 2.3755 2010: Stothers ω \leq 2.3737 2012: Williams ω \leq 2.3727 2020: Alman-Williams ω \leq 2.3728596 2024: Williams et al. ω \leq 2.371866 Lower bound: ω \geq 2 (trivially — must read n² elements) Conjecture: ω = 2 (open problem)

The Coppersmith-Winograd family of algorithms (and its descendants) use sophisticated algebraic machinery — tensor rank decomposition, laser method, group-theoretic techniques — producing algorithms so complex they are rarely implemented in practice. The crossover point over Strassen is astronomically large (n ~ 10¹⁰ or more). But theoretically, they confirm that matrix multiplication is closer to "linear" complexity than "cubic."

🔢 Explore Algorithms →