Machine Learning — From Linear Regression to Deep Learning

1. Learning Paradigms: Supervised, Unsupervised, RL

Machine learning encompasses three broad families of algorithms, distinguished by the nature of the feedback signal available during training:

Supervised Learning

Given a dataset D = {(x₁, y₁), …, (x_N, y_N)} of input-output pairs, learn a function f_θ: X → Y that generalises to unseen inputs. The goal is to minimise expected loss on the true data distribution — approximated by empirical risk minimisation (ERM) on the training set. Examples: image classification, regression, translation, speech recognition.

Unsupervised Learning

Given unlabelled data {x₁, …, x_N}, discover structure: clusters, manifolds, generative models, or compressed representations. The model receives no explicit target — it must find its own signal from patterns in the data. Examples: k-means clustering, principal component analysis (PCA), autoencoders, generative adversarial networks (GANs), variational autoencoders (VAEs), diffusion models.

Reinforcement Learning

An agent interacts with an environment, choosing actions a_t in state s_t and receiving scalar reward r_t. The goal is to learn a policy π(a|s) that maximises cumulative discounted reward:

G_t = Σ_{k=0}^{∞} γ^k · r_{t+k} (0 < γ < 1, discount factor) Optimal policy maximises: V^π(s) = E_π[G_t | s_t = s] (value function) Q^π(s,a) = E_π[G_t | s_t = s, a_t = a] (action-value function)

RL algorithms include Q-learning, policy gradients (REINFORCE), actor-critic methods (PPO, SAC), and model-based RL. AlphaGo/AlphaZero (2016–2017) and AlphaFold's energy minimisation used RL; ChatGPT uses Reinforcement Learning from Human Feedback (RLHF) to align language model outputs with human preferences.

🤖

Reinforcement Learning Simulator

Watch an agent learn to navigate a maze using Q-learning in real time

2. Bias-Variance Tradeoff

The expected test error of a model can be decomposed into three components:

E[(y - f̂(x))²] = Bias²[f̂(x)] + Var[f̂(x)] + σ²_noise Bias[f̂(x)] = E[f̂(x)] - f(x) (systematic error — underfitting) Var[f̂(x)] = E[(f̂(x) - E[f̂(x)])²] (sensitivity to training data — overfitting) σ²_noise = irreducible noise in the data

A high-bias model (e.g., linear regression on non-linear data) consistently makes the same type of error regardless of training data — it has insufficient capacity to capture the true pattern (underfitting). A high-variance model (e.g., a degree-50 polynomial fitted to 20 noisy points) fits the training data nearly perfectly but fluctuates wildly with different training sets (overfitting).

The tradeoff manifests as a U-shaped test error curve as model complexity increases: error first decreases as bias falls, then increases as variance rises. The optimal complexity sits at the minimum of test error. Modern large neural networks appear to violate this picture through the double descent phenomenon: after an initial U-curve, error decreases again as model size grows far beyond the interpolation threshold — a phenomenon not fully explained by classical statistical learning theory.

Practical implication: If your model has high training accuracy but poor test accuracy, you are overfitting — add regularisation, gather more data, or reduce model complexity. If training accuracy is also poor, you are underfitting — increase model capacity, train longer, or improve features.

3. Loss Functions and Gradient Descent

Training a model means minimising a loss function L(θ) over parameters θ. Common loss functions:

Regression: MSE = (1/N) Σ (y_i − f_θ(x_i))² Classification: Cross-entropy = −(1/N) Σ_i Σ_c y_{i,c} log p_{i,c} Binary: BCE = −(1/N) Σ [y log p + (1−y) log(1−p)]

Gradient descent iteratively moves parameters in the direction of steepest descent:

θ_{t+1} = θ_t − η · ∇_θ L(θ_t) where η is the learning rate (typical values: 10⁻⁴ to 10⁻¹) Stochastic gradient descent (SGD): compute gradient on a mini-batch B ⊂ D: θ_{t+1} = θ_t − η · (1/|B|) Σ_{(x,y)∈B} ∇_θ ℓ(f_θ(x), y)

Modern optimisers use adaptive learning rates. Adam (Kingma & Ba, 2014) maintains exponential moving averages of gradients (m_t) and squared gradients (v_t) per parameter:

m_t = β₁ m_{t−1} + (1−β₁) g_t (first moment, β₁ = 0.9) v_t = β₂ v_{t−1} + (1−β₂) g_t² (second moment, β₂ = 0.999) m̂_t = m_t / (1−β₁ᵗ) (bias-corrected) v̂_t = v_t / (1−β₂ᵗ) θ_{t+1} = θ_t − η · m̂_t / (√v̂_t + ε) (ε = 10⁻⁸ for numerical stability)

Adam adapts the step size per parameter, converging faster than vanilla SGD on most tasks. Variants include AdamW (weight decay decoupled), AdaGrad, RMSProp, and Lion.

4. Regularisation: L1, L2, and Dropout

Regularisation adds constraints or noise to prevent overfitting, biasing the model toward simpler solutions.

L2 Regularisation (Ridge / Weight Decay)

Adds a penalty proportional to the squared magnitude of weights:

L_total = L_data + λ Σ_j θ_j² Gradient update: θ_j \leftarrow θ_j (1 - ηλ) - η \partialL_data/\partialθ_j Effect: shrinks all weights toward zero; no exact sparsity. Bayesian interpretation: equivalent to a Gaussian prior on weights.

L1 Regularisation (Lasso)

Adds a penalty proportional to the absolute value of weights:

L_total = L_data + λ Σ_j |θ_j| Effect: drives some weights exactly to zero \to sparse solutions. Bayesian interpretation: equivalent to a Laplace prior on weights. Used for feature selection in high-dimensional problems.

Dropout

During training, randomly zero out each neuron with probability p (typically p = 0.1 to 0.5). At test time, all neurons are active and outputs are scaled by (1 − p):

Training: ĥ_i = h_i \cdot Bernoulli(1-p) / (1-p) (inverted dropout) Test: ĥ_i = h_i (no scaling needed)

Dropout can be interpreted as training an ensemble of 2^N different thinned networks (for N neurons) and averaging them at test time. It prevents co-adaptation: neurons cannot rely on specific other neurons always being present. Modern large language models use dropout rates of 0.1–0.3 in attention layers; vision transformers often use no dropout but use stochastic depth (dropping entire residual blocks instead).

5. Convolutional Neural Networks

Convolutional Neural Networks (CNNs) exploit the translation invariance of natural images to drastically reduce parameter count compared to fully connected networks. Three key operations define a CNN:

Convolution Layer

(X * W)_{i,j,k} = Σ_{di,dj,c} X_{i·s+di, j·s+dj, c} · W_{di,dj,c,k} where: X is the input feature map (height × width × channels) W is the filter tensor (kernel_h × kernel_w × in_channels × out_channels) s is the stride; k indexes the output channel (feature map)

A 3×3 convolutional kernel with 64 input and 128 output channels has only 3×3×64×128 = 73,728 parameters — versus 64×128×H×W parameters for a fully connected layer. Spatial sharing is the key: the same filter is applied everywhere in the image.

Pooling Layer

Max pooling (common) or average pooling reduces spatial dimensions by taking the maximum or mean over a local window (typically 2×2 with stride 2), halving the spatial resolution. Pooling provides a degree of translation invariance and reduces computation.

Flatten and Fully Connected Head

After several conv-pool blocks, the spatial feature map is flattened to a vector and passed through fully connected layers for classification. Modern architectures replace the FC head with global average pooling (GAP), further reducing parameters and overfitting.

Landmark CNN architectures: AlexNet (2012, won ImageNet by a large margin), VGG-16 (2014, deep and uniform), ResNet (2015, residual connections enabling 152+ layers), EfficientNet (2019, neural architecture search). Since 2020, Vision Transformers (ViT) have matched or surpassed CNNs on large datasets.

🧠

Neural Network Training Simulator

Visualise forward pass, backpropagation, and weight updates in real time

6. Recurrent Networks and LSTMs

Standard feedforward networks process fixed-size inputs. Recurrent Neural Networks (RNNs) maintain a hidden state h_t that accumulates information across a sequence x₁, x₂, …:

h_t = tanh(W_h · h_{t−1} + W_x · x_t + b) y_t = W_y · h_t + b_y Parameters W_h, W_x are shared across all time steps — enabling variable-length sequence processing with fixed parameter count.

Simple RNNs suffer from vanishing and exploding gradients: when backpropagating through many time steps, gradients are multiplied by W_h at each step. If ||W_h|| < 1, gradients vanish; if ||W_h|| > 1, they explode. This makes it impossible to learn long-range dependencies.

Long Short-Term Memory (LSTM)

Hochreiter & Schmidhuber (1997) introduced the LSTM cell with explicit gating mechanisms to control information flow over long sequences:

f_t = σ(W_f · [h_{t−1}, x_t] + b_f) (forget gate) i_t = σ(W_i · [h_{t−1}, x_t] + b_i) (input gate) o_t = σ(W_o · [h_{t−1}, x_t] + b_o) (output gate) c̃_t = tanh(W_c · [h_{t−1}, x_t] + b_c) (candidate cell) c_t = f_t ⊙ c_{t−1} + i_t ⊙ c̃_t (cell state update) h_t = o_t ⊙ tanh(c_t) (hidden state) σ = sigmoid, ⊙ = element-wise product

The cell state c_t acts as a "memory highway" that can carry information unchanged across hundreds of time steps. The forget gate can learn to clear the memory; the input gate can learn to write selectively. LSTMs dominated sequence modelling (language, speech, time series) from ~2015 until transformers superseded them in 2017–2019.

7. The Attention Mechanism

Attention allows a model to focus on relevant parts of its input when producing each output, rather than compressing the entire input into a fixed-size vector. The scaled dot-product attention (Bahdanau 2015, Luong 2015, Vaswani 2017) computes:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V where: Q ∈ R^{n×d_k} = query matrix (from decoder or current token) K ∈ R^{m×d_k} = key matrix (from encoder or past tokens) V ∈ R^{m×d_v} = value matrix (from encoder or past tokens) d_k = key dimension (scaling prevents softmax saturation) Attention weight A_{ij} = exp(q_i · k_j / √d_k) / Σ_l exp(q_i · k_l / √d_k)

Each query token i attends to every key token j with weight A_{ij}, which can be interpreted as a soft retrieval: the output for query i is a weighted sum of values, where the weights measure query-key compatibility. The scaling by 1/√d_k prevents the dot products from growing large in high dimensions, which would cause softmax outputs to concentrate near zero or one (vanishing gradients).

Multi-Head Attention

MultiHead(Q, K, V) = Concat(head_1, \dots, head_h) \cdot W_O head_i = Attention(Q\cdotW_i^Q, K\cdotW_i^K, V\cdotW_i^V) Typical: h = 8 heads, d_model = 512, d_k = d_v = d_model/h = 64

Multiple heads allow the model to attend to different aspects simultaneously — one head might capture syntactic relationships, another semantic similarity, another positional proximity. The outputs are concatenated and projected back to d_model.

8. The Transformer Architecture

Vaswani et al. (2017) "Attention Is All You Need" introduced the transformer — discarding recurrence entirely in favour of pure attention. The encoder-decoder transformer consists of stacked identical layers:

Encoder Layer

h' = LayerNorm(x + MultiHeadSelfAttention(x, x, x)) h = LayerNorm(h' + FFN(h')) FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂ (two-layer MLP, ReLU activation) W₁ ∈ R^{d_model × d_ff}, d_ff = 4·d_model typically

Self-attention allows each position to attend to all positions in the same layer — capturing long-range dependencies in O(1) path length instead of the O(n) path length of RNNs. Residual connections (x + …) and Layer Normalisation are crucial for stable training of deep stacks (6–96 layers in practice).

Positional Encoding

Unlike RNNs, attention has no built-in notion of order. Positional information is injected by adding a positional encoding PE to the input embeddings:

PE(pos, 2i) = sin(pos / 10000^{2i/d_model}) PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model}) Modern LLMs use Rotary Position Embedding (RoPE) or ALiBi instead, which generalise better to longer sequences than trained on.

Scaling Laws

Kaplan et al. (2020) showed that transformer performance scales as a power law in model size N, dataset size D, and compute budget C:

L(N) ∝ N^{−0.076} (loss decreasing with parameters) L(D) ∝ D^{−0.095} (loss decreasing with tokens) Chinchilla law (Hoffmann 2022): N_opt ∝ C^{0.5}, D_opt ∝ C^{0.5} → optimal: ~20 tokens per parameter for compute-optimal training

🌲

Decision Tree Simulator

Explore how tree-based models split data and compare with neural approaches

9. Beyond Supervised Learning

The transformer architecture powers systems well beyond text classification. A few notable extensions illustrate the breadth of modern ML:

Diffusion models (DDPM, 2020): learn to reverse a gradual Gaussian noise process. At inference, start from pure noise and iteratively denoise to generate images or audio. Now the dominant approach for high-quality image generation (Stable Diffusion, DALL-E 3, Sora).
Graph Neural Networks (GNNs): extend convolution to irregular graph structures. Essential for molecular property prediction, social network analysis, and chip design (Google used a GNN to design the TPUv4 floorplan).
Self-supervised pre-training: mask tokens and predict them (BERT), or predict the next token (GPT). Unlabelled data provides a near-unlimited training signal. Pre-trained models fine-tune to downstream tasks with few labelled examples.
Neural Scaling and Emergent Abilities: capabilities such as multi-step reasoning, in-context learning, and chain-of-thought appear abruptly at certain scale thresholds — behaviours not predictable by extrapolating from smaller models.

Current frontier (2025–2026): Mixture-of-Experts (MoE) architectures activate only a fraction of parameters per token, enabling trillion-parameter models at manageable compute cost. State Space Models (Mamba, S4) offer linear-time sequence modelling as an alternative to quadratic-cost attention. Test-time compute scaling (chain-of-thought, search, world models) is emerging as a new axis alongside parameter and data scaling.