Machine Learning — From Linear Regression to Deep Learning
Machine learning has transformed every field it touches — computer vision, natural language, protein folding, drug discovery, game playing. Behind the impressive results lies a surprisingly coherent mathematical framework: optimise a loss function over parameters using gradient descent, control generalisation with regularisation, and scale up with deep architectures. This article traces the conceptual path from linear regression to the transformer — the architecture powering today's large language models.
1. Learning Paradigms: Supervised, Unsupervised, RL
Machine learning encompasses three broad families of algorithms, distinguished by the nature of the feedback signal available during training:
Supervised Learning
Given a dataset D = {(x₁, y₁), …, (x_N, y_N)} of input-output pairs, learn a function f_θ: X → Y that generalises to unseen inputs. The goal is to minimise expected loss on the true data distribution — approximated by empirical risk minimisation (ERM) on the training set. Examples: image classification, regression, translation, speech recognition.
Unsupervised Learning
Given unlabelled data {x₁, …, x_N}, discover structure: clusters, manifolds, generative models, or compressed representations. The model receives no explicit target — it must find its own signal from patterns in the data. Examples: k-means clustering, principal component analysis (PCA), autoencoders, generative adversarial networks (GANs), variational autoencoders (VAEs), diffusion models.
Reinforcement Learning
An agent interacts with an environment, choosing actions a_t in state s_t and receiving scalar reward r_t. The goal is to learn a policy π(a|s) that maximises cumulative discounted reward:
RL algorithms include Q-learning, policy gradients (REINFORCE), actor-critic methods (PPO, SAC), and model-based RL. AlphaGo/AlphaZero (2016–2017) and AlphaFold's energy minimisation used RL; ChatGPT uses Reinforcement Learning from Human Feedback (RLHF) to align language model outputs with human preferences.
2. Bias-Variance Tradeoff
The expected test error of a model can be decomposed into three components:
A high-bias model (e.g., linear regression on non-linear data) consistently makes the same type of error regardless of training data — it has insufficient capacity to capture the true pattern (underfitting). A high-variance model (e.g., a degree-50 polynomial fitted to 20 noisy points) fits the training data nearly perfectly but fluctuates wildly with different training sets (overfitting).
The tradeoff manifests as a U-shaped test error curve as model complexity increases: error first decreases as bias falls, then increases as variance rises. The optimal complexity sits at the minimum of test error. Modern large neural networks appear to violate this picture through the double descent phenomenon: after an initial U-curve, error decreases again as model size grows far beyond the interpolation threshold — a phenomenon not fully explained by classical statistical learning theory.
3. Loss Functions and Gradient Descent
Training a model means minimising a loss function L(θ) over parameters θ. Common loss functions:
Gradient descent iteratively moves parameters in the direction of steepest descent:
Modern optimisers use adaptive learning rates. Adam (Kingma & Ba, 2014) maintains exponential moving averages of gradients (m_t) and squared gradients (v_t) per parameter:
Adam adapts the step size per parameter, converging faster than vanilla SGD on most tasks. Variants include AdamW (weight decay decoupled), AdaGrad, RMSProp, and Lion.
4. Regularisation: L1, L2, and Dropout
Regularisation adds constraints or noise to prevent overfitting, biasing the model toward simpler solutions.
L2 Regularisation (Ridge / Weight Decay)
Adds a penalty proportional to the squared magnitude of weights:
L1 Regularisation (Lasso)
Adds a penalty proportional to the absolute value of weights:
Dropout
During training, randomly zero out each neuron with probability p (typically p = 0.1 to 0.5). At test time, all neurons are active and outputs are scaled by (1 − p):
Dropout can be interpreted as training an ensemble of 2^N different thinned networks (for N neurons) and averaging them at test time. It prevents co-adaptation: neurons cannot rely on specific other neurons always being present. Modern large language models use dropout rates of 0.1–0.3 in attention layers; vision transformers often use no dropout but use stochastic depth (dropping entire residual blocks instead).
5. Convolutional Neural Networks
Convolutional Neural Networks (CNNs) exploit the translation invariance of natural images to drastically reduce parameter count compared to fully connected networks. Three key operations define a CNN:
Convolution Layer
A 3×3 convolutional kernel with 64 input and 128 output channels has only 3×3×64×128 = 73,728 parameters — versus 64×128×H×W parameters for a fully connected layer. Spatial sharing is the key: the same filter is applied everywhere in the image.
Pooling Layer
Max pooling (common) or average pooling reduces spatial dimensions by taking the maximum or mean over a local window (typically 2×2 with stride 2), halving the spatial resolution. Pooling provides a degree of translation invariance and reduces computation.
Flatten and Fully Connected Head
After several conv-pool blocks, the spatial feature map is flattened to a vector and passed through fully connected layers for classification. Modern architectures replace the FC head with global average pooling (GAP), further reducing parameters and overfitting.
Landmark CNN architectures: AlexNet (2012, won ImageNet by a large margin), VGG-16 (2014, deep and uniform), ResNet (2015, residual connections enabling 152+ layers), EfficientNet (2019, neural architecture search). Since 2020, Vision Transformers (ViT) have matched or surpassed CNNs on large datasets.
6. Recurrent Networks and LSTMs
Standard feedforward networks process fixed-size inputs. Recurrent Neural Networks (RNNs) maintain a hidden state h_t that accumulates information across a sequence x₁, x₂, …:
Simple RNNs suffer from vanishing and exploding gradients: when backpropagating through many time steps, gradients are multiplied by W_h at each step. If ||W_h|| < 1, gradients vanish; if ||W_h|| > 1, they explode. This makes it impossible to learn long-range dependencies.
Long Short-Term Memory (LSTM)
Hochreiter & Schmidhuber (1997) introduced the LSTM cell with explicit gating mechanisms to control information flow over long sequences:
The cell state c_t acts as a "memory highway" that can carry information unchanged across hundreds of time steps. The forget gate can learn to clear the memory; the input gate can learn to write selectively. LSTMs dominated sequence modelling (language, speech, time series) from ~2015 until transformers superseded them in 2017–2019.
7. The Attention Mechanism
Attention allows a model to focus on relevant parts of its input when producing each output, rather than compressing the entire input into a fixed-size vector. The scaled dot-product attention (Bahdanau 2015, Luong 2015, Vaswani 2017) computes:
Each query token i attends to every key token j with weight A_{ij}, which can be interpreted as a soft retrieval: the output for query i is a weighted sum of values, where the weights measure query-key compatibility. The scaling by 1/√d_k prevents the dot products from growing large in high dimensions, which would cause softmax outputs to concentrate near zero or one (vanishing gradients).
Multi-Head Attention
Multiple heads allow the model to attend to different aspects simultaneously — one head might capture syntactic relationships, another semantic similarity, another positional proximity. The outputs are concatenated and projected back to d_model.
8. The Transformer Architecture
Vaswani et al. (2017) "Attention Is All You Need" introduced the transformer — discarding recurrence entirely in favour of pure attention. The encoder-decoder transformer consists of stacked identical layers:
Encoder Layer
Self-attention allows each position to attend to all positions in the same layer — capturing long-range dependencies in O(1) path length instead of the O(n) path length of RNNs. Residual connections (x + …) and Layer Normalisation are crucial for stable training of deep stacks (6–96 layers in practice).
Positional Encoding
Unlike RNNs, attention has no built-in notion of order. Positional information is injected by adding a positional encoding PE to the input embeddings:
Scaling Laws
Kaplan et al. (2020) showed that transformer performance scales as a power law in model size N, dataset size D, and compute budget C:
9. Beyond Supervised Learning
The transformer architecture powers systems well beyond text classification. A few notable extensions illustrate the breadth of modern ML:
- Diffusion models (DDPM, 2020): learn to reverse a gradual Gaussian noise process. At inference, start from pure noise and iteratively denoise to generate images or audio. Now the dominant approach for high-quality image generation (Stable Diffusion, DALL-E 3, Sora).
- Graph Neural Networks (GNNs): extend convolution to irregular graph structures. Essential for molecular property prediction, social network analysis, and chip design (Google used a GNN to design the TPUv4 floorplan).
- Self-supervised pre-training: mask tokens and predict them (BERT), or predict the next token (GPT). Unlabelled data provides a near-unlimited training signal. Pre-trained models fine-tune to downstream tasks with few labelled examples.
- Neural Scaling and Emergent Abilities: capabilities such as multi-step reasoning, in-context learning, and chain-of-thought appear abruptly at certain scale thresholds — behaviours not predictable by extrapolating from smaller models.