Learning #22 — Linear Algebra & Data Science: Matrices, Eigenvectors, Regression and Machine Learning

Matrix Transforms Eigenvectors & PCA OLS Regression Bayesian Inference K-Means++ Gradient Descent Backpropagation Convolution

Linear algebra is the language of modern computing. PageRank, image compression, recommendation systems, physics simulations, 3D rendering and neural networks are all, at their core, sequences of matrix operations. Yet linear algebra is often taught as pure symbol manipulation — rows and columns with no geometric meaning. This post explores it the opposite way: start with the picture, then derive the formula.

1. Matrix Transformations — What Matrices Do

A 2×2 matrix does exactly one thing: it transforms vectors in the plane. It can stretch, rotate, reflect, shear or collapse. The determinant — a single number — tells you how areas change. When det = 0 the matrix collapses the plane to a line or a point; information is permanently lost.

The Geometry of 2×2 Matrices

Transformation:  T(v) = Mv
  M = [[a, b], [c, d]]   maps the unit square to a parallelogram.

Determinant:  det(M) = ad − bc
  |det| = area scaling factor
  det > 0: orientation preserved  |  det < 0: reflection
  det = 0: rank-deficient (singular) — maps to lower dim

Eigenvectors:  Mv = λv
  (M − λI)v = 0   ⟹  det(M − λI) = 0  (characteristic polynomial)
  Eigenvalues λ₁,λ₂ tell you the stretching factors in special directions.

For 2×2:  λ = (tr(M)/2) ± √((tr(M)/2)² − det(M))
  Real eigenvalues: M has orthogonal eigenbasis (symmetric M)
  Complex eigenvalues: M has a rotation component

Trace = λ₁ + λ₂ = a + d
  (sum of eigenvalues = sum of diagonal)

Spectral theorem (symmetric M = Mᵀ):
  M = QΛQᵀ,   Q orthogonal,  Λ = diag(λ₁, λ₂)
  Every symmetric matrix can be diagonalised — the foundation of PCA.

The matrix transforms simulation runs 8 preset transformations — rotation, scaling, shear, reflection, projection — plus free-edit mode. Toggle the unit circle, eigenvector arrows and the grid deformation to see determinant as area and trace as the sum of stretches. Pay attention to the identity and nilpotent cases: one does nothing; the other collapses in two steps.

Linear Algebra

Matrix Transformations 2D

Rotate, scale, shear, project — visualise det, trace, eigenvalues and eigenvectors live.

2. Linear Regression — Fitting Lines to Noisy Data

Ordinary least squares (OLS) is the simplest model in statistics and one of the most widely used. Given n data points, find the line that minimises the sum of squared vertical distances. The solution has a beautiful closed form — and it is a special case of the normal equations in linear algebra.

OLS Regression — Normal Equations & Diagnostics

Goal:  minimise SSE = Σ(yᵢ − ŷᵢ)² = ‖y − Xβ‖²

Normal equations:  (XᵀX)β = Xᵀy
  Solution:  β̂ = (XᵀX)⁻¹Xᵀy  (unique if X full column rank)

For simple regression (one predictor + intercept):
  β̂₁ = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)²  = Cov(x,y)/Var(x)
  β̂₀ = ȳ − β̂₁x̄

Coefficient of determination:
  R² = 1 − SSE/SST = 1 − Σ(yᵢ−ŷᵢ)²/Σ(yᵢ−ȳ)²
  R² = r²  (for simple regression, r = Pearson correlation)

Pearson r:  r = Cov(x,y)/[σₓ σᵧ] ∈ [−1, 1]
  |r| = 0: no linear relationship
  |r| = 1: perfect linear relationship

Geometric view:
  β̂ = (XᵀX)⁻¹Xᵀy is the orthogonal projection of y onto col(X).
  Residual vector e = y − Xβ̂ ⊥ col(X)  →  Xᵀe = 0  (always!)

Gauss–Markov theorem:
  OLS is BLUE (Best Linear Unbiased Estimator) when
  errors are uncorrelated, homoscedastic, zero-mean.
  Violation → Heteroscedasticity, use WLS or GLS.

Click the scatter plot canvas to add points and watch the regression line update instantly. The residual lines (vertical distances) show exactly what OLS minimises. Drag the outlier preset to see leverage — how one extreme point can rotate the entire line — and understand why R² alone is insufficient for model assessment (Anscombe's quartet).

Statistics · ML

Linear Regression OLS

Add/move/delete points, live R² and Pearson r, residual visualisation and 5 presets.

Statistics

Bayesian Inference

Prior × likelihood → posterior. Medical tests, coin flips and iterative belief updating.

3. Bayesian Inference — Updating Beliefs with Data

Bayes' theorem is four words: posterior is prior times likelihood, normalised. Where frequentist statistics asks "What is the probability of this data given the null hypothesis?", Bayesian statistics asks "What is the probability of the hypothesis given this data?" — a far more natural question for most scientific problems.

Bayes' Theorem & Sequential Updating

Bayes' theorem:  P(H|E) = P(E|H)·P(H) / P(E)
  H = hypothesis,  E = evidence (observed data)
  P(H)   = prior probability
  P(E|H) = likelihood
  P(E)   = marginal likelihood (normalisation constant)
  P(H|E) = posterior probability

Odds form (avoids computing P(E)):
  Posterior odds = Prior odds × Likelihood ratio (Bayes factor)
  BF = P(E|H₁)/P(E|H₀)   (BF > 10 → "strong" evidence)

Conjugate priors (closed-form posteriors):
  Beta–Binomial:
    Prior: θ ~ Beta(α, β)
    Likelihood: k successes in n trials → Binomial(n, θ)
    Posterior: θ | data ~ Beta(α+k, β+n−k)

Medical test example — base rate matters:
  Disease prevalence P(D) = 0.001 (1 in 1000)
  Test sensitivity P(+|D) = 0.99
  Test specificity P(−|¬D) = 0.99
  P(D|+) = 0.99×0.001 / [0.99×0.001 + 0.01×0.999] ≈ 0.09
  → 91% false positive rate despite 99% accurate test!

The simulation runs the medical test scenario, letting you adjust prevalence, sensitivity and specificity. The result — that a positive test from a rare disease is usually wrong — is the most important and most counter-intuitive result in applied probability. Toggle to sequential updating and add observations one by one to watch the posterior sharpen from a flat prior.

4. K-Means Clustering — Partitioning Without Labels

Supervised machine learning requires labelled data — expensive to collect. Clustering is unsupervised: given unlabelled points, find natural groupings. K-Means is the simplest and most widely used clustering algorithm, alternating between assigning points to their nearest centroid and moving each centroid to its cluster's mean.

K-Means Algorithm & K-Means++ Initialisation

Lloyd's algorithm (K-Means):
  Initialise k centroids μ₁,…,μₖ  (randomly or K-Means++)
  Repeat until convergence:
    1. Assign:   zᵢ = argmin_j ‖xᵢ − μⱼ‖²  (nearest centroid)
    2. Update:   μⱼ = (1/|Cⱼ|) Σ_{i∈Cⱼ} xᵢ  (cluster mean)
  Objective:  J = Σᵢ ‖xᵢ − μ_{z_i}‖²  (within-cluster sum of squares)

  Convergence guaranteed (J decreases monotonically)
  But not global optimum — sensitive to initialisation!

K-Means++ (Arthur & Vassilvitskii 2007):
  1. Choose μ₁ uniformly at random.
  2. For j = 2,…,k:
     Choose μⱼ with probability ∝ D(x)² = min_i ‖x − μᵢ‖²
  Expected J ≤ 8(ln k + 2) · J*   (approximation guarantee)

Choosing k — Elbow method:
  Plot J vs k.  Look for the "elbow" where ∂J/∂k decreases sharply.

Silhouette score:   s(i) = (b(i) − a(i)) / max(a(i), b(i))
  a(i) = mean intra-cluster distance
  b(i) = mean nearest-cluster distance
  s ∈ [−1, 1]:  1 = perfectly clustered, 0 = boundary, −1 = wrong cluster

Voronoi diagram:   each cell = {x : ‖x−μⱼ‖ ≤ ‖x−μₗ‖ ∀l≠j}
  K-Means boundaries are always linear (Voronoi tessellation).

The simulation renders the full Voronoi partition pixel-by-pixel using ImageData — every pixel is coloured by its nearest centroid. Step through Lloyd's algorithm to see convergence, switch to K-Means++ to compare initialisation quality, and use the elbow plot to discover the "right" k for three different synthetic datasets.

Machine Learning

K-Means Clustering

Voronoi regions, K-Means++ init, elbow method — step through each Lloyd iteration.

5. Gradient Descent — Navigating Loss Landscapes

Nearly every modern machine learning model is trained by gradient descent: repeatedly compute the gradient of the loss function with respect to parameters, then take a small step downhill. The variants — SGD, Momentum, RMSprop, Adam — differ only in how they use the gradient history to adapt the step size.

Gradient Descent Variants & Convergence

Vanilla SGD:
  θ ← θ − α∇L(θ)   (α = learning rate)
  Convergence for convex L: O(1/√T) for non-smooth, O(1/T) for smooth

Momentum (Polyak 1964):
  v ← βv − α∇L(θ)
  θ ← θ + v
  β ≈ 0.9.  Accelerates along consistent gradient directions.
  Nesterov variant: evaluate gradient at θ + βv (lookahead)

RMSprop (Hinton 2012, unpublished):
  s ← ρs + (1−ρ)(∇L)²
  θ ← θ − (α/√(s+ε))∇L
  Adapts per-parameter lr — useful for non-stationary objectives

Adam (Kingma & Ba 2014):
  m ← β₁m + (1−β₁)∇L        (first moment)
  v ← β₂v + (1−β₂)(∇L)²     (second moment)
  m̂ = m/(1−β₁ᵗ),  v̂ = v/(1−β₂ᵗ)   (bias correction)
  θ ← θ − α·m̂/(√v̂+ε)
  Defaults: α=1e-3, β₁=0.9, β₂=0.999, ε=1e-8
  Convergence ≈ O(1/√T) for non-convex objectives

Rosenbrock (banana) function:
  f(x,y) = (1−x)² + 100(y−x²)²
  Global min at (1,1). Notoriously hard: steep valley with gentle slope.
  Adam ≈ 800 iterations to converge vs SGD ≈ 10 000+

The simulation renders a 60×60 loss landscape as a colour-coded 3D surface (painter's algorithm) and traces the optimiser trajectory in real time. Run all four optimisers simultaneously on the Rosenbrock banana function and Himmelblau's multi-minima function. The visual difference between SGD's zigzag, Momentum's overshooting and Adam's clean descent explains why Adam became the universal default for deep learning.

Machine Learning · Optimisation

Gradient Descent

SGD, Momentum, RMSprop, Adam — 3D loss landscapes, rotation, comparison mode.

6. Convolutional Neural Networks — Feature Extraction by Design

A fully connected neural network applied to a 224×224 image would require 50,000+ weights per neuron — computationally intractable. The insight behind CNNs: nearby pixels are correlated, and the same feature (an edge, a texture) can appear anywhere in the image. Replacing full connections with local filters and weight sharing reduces parameters by orders of magnitude while building in translation invariance.

Convolution, ReLU & Pooling Layers

2D discrete convolution:
  (f * g)[i,j] = Σₘ Σₙ f[m,n]·g[i−m, j−n]
  In CNNs, g is a learned filter (kernel) of size k×k.

Output size (no padding):
  H_out = (H_in − k) / stride + 1

Feature map computation (for one filter W, bias b):
  Z[i,j] = (X * W)[i,j] + b  (pre-activation)
  A[i,j] = ReLU(Z[i,j]) = max(0, Z[i,j])

Common filters (hand-designed):
  Horizontal edge:  [[-1,-1,-1],[0,0,0],[1,1,1]]
  Vertical edge:    [[-1,0,1],[-1,0,1],[-1,0,1]]
  Sharpen:          [[0,-1,0],[-1,5,-1],[0,-1,0]]
  Gaussian blur:    1/16 · [[1,2,1],[2,4,2],[1,2,1]]

Max pooling (2×2, stride 2):
  Halves spatial dimensions, keeps the maximum activation.
  Provides invariance to small translations.

Backpropagation through convolution:
  ∂L/∂W = input * ∂L/∂Z  (correlation, not convolution)
  ∂L/∂X = full convolution of ∂L/∂Z with flipped W

Parameter count (one conv layer, k=3, C_in channels, C_out filters):
  (k² · C_in + 1) · C_out  =  (9 · C_in + 1) · C_out
  vs. fully connected: H·W·C_in · H·W·C_out  (orders of magnitude more)

Select one of five 8×8 input images and toggle between the four hand-designed filters. Navigate the layer tabs — Input → Conv → ReLU → MaxPool → deeper layers — and watch the animated scan position extract features step by step. The architecture panel tracks spatial dimensions and parameter counts, making the computational advantage of weight sharing concrete.

Machine Learning · Deep Learning

Convolutional Neural Network

Conv, ReLU, MaxPool layers — animated scan, five inputs, four learned filters.

The Unity of Linear Algebra and Data Science

All six simulations are different faces of the same geometric structure. Linear regression is a projection. K-Means partitions space with a Voronoi diagram whose boundaries are the perpendicular bisectors defined by a Euclidean metric. Gradient descent follows the gradient of a quadratic form — which, near a minimum, is always dominated by the Hessian matrix and its eigenvalues. A convolutional layer is a structured sparse matrix multiplication. Even Bayes' theorem, applied to a Gaussian prior and Gaussian likelihood, gives a Gaussian posterior whose mean is a weighted average — a linear operation.

The three indispensable matrix factorisations: (1) Eigendecomposition M = QΛQᵀ — symmetry, PCA, spectral methods; (2) SVD M = UΣVᵀ — general matrices, recommender systems, image compression, pseudo-inverse; (3) LU / QR / Cholesky — solving linear systems, OLS normal equations. If you understand these three, you understand 90% of numerical linear algebra.

The machine learning series continues with decision trees (non-linear partitions) and reinforcement learning in Learning #9 — Machine Learning, and the statistical foundations are explored in Learning #15 — Probability & Statistics. For the signal processing side of data science — FFT, filters, spectrograms — see Learning #14 — Signal Processing.

Linear Algebra