Neural Networks & Backpropagation — The Maths Behind AI
Every image recognised by your phone, every sentence completed by a language model, every recommendation served by a streaming platform — all emerge from the same algorithm: a network of weighted sums shaped by backpropagation. The core idea is disarmingly simple: make a prediction, measure the error, and nudge every weight in the direction that would have reduced that error. Repeat millions of times. What follows is the precise mathematics that transforms this intuition into the engine of the AI revolution.
1. From Perceptron to MLP
The story begins in 1958 with Frank Rosenblatt's perceptron — a model of a single biological neuron. It takes n input signals x₁, x₂, ..., xₙ, multiplies each by a learned weight, sums them with a bias, and fires (outputs 1) if the result exceeds a threshold, otherwise stays silent (outputs 0):
The perceptron learning rule — increase weights for correctly classified inputs, decrease for misclassified — was shown to converge for any linearly separable problem. The 1969 book Perceptrons by Minsky and Papert famously proved that a single-layer perceptron cannot compute XOR, effectively freezing AI research for over a decade.
The solution, already known theoretically, was to stack perceptrons into layers. A multilayer perceptron (MLP) — also called a feedforward neural network or deep network — has:
- An input layer of dimension n (one node per input feature)
- One or more hidden layers that learn intermediate representations
- An output layer that produces the final prediction
With at least one hidden layer and a non-linear activation function, an MLP can approximate any continuous function on a compact domain — a fact formalised by the Universal Approximation Theorem (Section 9). The practical revolution came in the 1980s when Rumelhart, Hinton, and Williams showed how to efficiently train MLPs using backpropagation.
2. The Forward Pass: z = Wx + b
Consider a network with L layers. Let layer l have nₗ neurons. The computation at each layer proceeds in two steps:
Step 1 — Linear transformation: Form the pre-activation vector z by multiplying the weight matrix W^(l) (shape nₗ × nₗ₋₁) by the activations from the previous layer a^(l-1) and adding the bias vector b^(l):
Step 2 — Non-linear activation: Apply a scalar activation function σ element-wise to produce the activations for the next layer:
The input to the network sets a^(0) = x (the raw features). The output of the last layer a^(L) = ŷ is the network's prediction. For classification tasks the softmax function is typically applied to the final layer to convert raw scores (logits) into a probability distribution over classes:
Each output node ŷ_k gives the network's estimated probability that input x belongs to class k. All outputs are positive and sum to 1, satisfying the axioms of a probability distribution.
3. Activation Functions
The activation function σ is the source of all non-linearity — and therefore all expressive power — in a neural network. Three families dominate modern practice:
Sigmoid
Historically dominant. Maps any real input to (0,1), making it natural for binary classification output layers. Its gradient σ'(z) = σ(z)(1 − σ(z)) peaks at 0.25 and collapses toward zero for |z| > 3 — the root cause of the vanishing gradient problem (Section 7).
Hyperbolic Tangent (tanh)
Zero-centred variant of sigmoid, often preferred in hidden layers. Same vanishing gradient problem.
Rectified Linear Unit (ReLU)
Introduced as an activation for deep networks by Nair and Hinton (2010). Computationally trivial, gradient is 1 for z > 0 and 0 for z < 0 — dramatically alleviating vanishing gradients. Powers virtually all modern deep networks. Variants include Leaky ReLU (small negative slope for z < 0), ELU, and GELU (used in Transformers).
Neural Network Simulator Build layers, choose activations, and watch training converge in real time4. Loss Function: Cross-Entropy
Training requires a scalar measure of how wrong the network's predictions are. For multi-class classification with C classes, the standard choice is the cross-entropy loss:
where y_k ∈ {0,1} is the one-hot true label (1 for the correct class, 0 for all others) and ŷ_k is the network's predicted probability for class k. Because exactly one y_k = 1, this reduces to:
The loss is zero when ŷ_correct = 1 (perfect confidence) and approaches infinity as ŷ_correct → 0. This is exactly what we want: heavily penalise confident wrong predictions.
Cross-entropy arises naturally from maximum likelihood estimation. If we model the output as a categorical distribution, maximising the log-likelihood of the training labels is equivalent to minimising the cross-entropy loss. This gives the loss a solid probabilistic foundation beyond mere convenience.
For binary classification (C = 2) this simplifies to binary cross-entropy:
For regression tasks the common choice is mean squared error (MSE):
5. Backpropagation and the Chain Rule
We need to compute ∂L/∂w for every weight w in the network so we can update weights to reduce the loss. A network with millions of parameters makes this seem intractable, but backpropagation solves it in a single backward pass using the chain rule of calculus.
The chain rule states that if L depends on z which depends on w, then:
In a network, the loss L depends on the output ŷ, which depends on activations, which depend on pre-activations z, which depend on weights W. Expanding the chain:
The key insight of backpropagation is that these partial derivatives can be computed efficiently by passing a gradient signal (called delta, δ) backwards through the network, reusing computations at each layer. For layer l, define:
Starting from the output layer (using, e.g., softmax + cross-entropy, where the gradient is simply ŷ − y), propagate backwards:
where ⊙ denotes element-wise multiplication and σ'(z^(l)) is the derivative of the activation function evaluated at the pre-activations of layer l. The weight gradient is then simply:
Total computation: one forward pass + one backward pass = 2× a forward pass. Regardless of network depth, all gradients are computed in constant multiples of a single forward pass — this is what makes training deep networks tractable.
6. Stochastic Gradient Descent
Once gradients are computed, weights are updated to descend the loss landscape. Gradient descent takes a step in the negative gradient direction:
where η (eta) is the learning rate — the step size. Too large and updates overshoot minima; too small and training is impractically slow.
In practice, computing the exact gradient over the entire training set (batch gradient descent) is computationally prohibitive for large datasets. Instead, stochastic gradient descent (SGD) estimates the gradient from a randomly selected mini-batch of B examples (typically B = 32–512):
This noisy gradient estimate is surprisingly effective: the noise acts as implicit regularisation, helping escape shallow local minima, and mini-batch parallelism makes GPU acceleration straightforward.
Momentum and Adaptive Methods
Vanilla SGD converges slowly in ravines — regions where the curvature is much higher in one direction than another. Modern optimisers address this:
- SGD + Momentum accumulates an exponentially decaying average of past gradients, accelerating in consistent directions.
- Adam (Kingma and Ba, 2014) maintains per-parameter estimates of both the first moment (mean gradient) and second moment (uncentred variance), dividing by the square root of the latter to adaptively scale the learning rate. Adam is the default for most modern deep learning.
- AdamW decouples weight decay from the Adam update, improving generalisation and now dominant in language model training.
7. The Vanishing Gradient Problem
Backpropagation multiplies gradients layer by layer using σ'(z). For sigmoid, the maximum value of σ'(z) is 0.25. In a network with L = 20 layers, the gradient of the loss with respect to the first layer's weights passes through 20 multiplication by values typically much less than 1:
With sigmoid, typical products like 0.2^20 ≈ 10⁻¹⁴ make the gradient essentially zero. Early layers receive almost no training signal — their weights barely move while later layers overfit. This was the dominant obstacle to deep networks throughout the 1990s.
Several solutions emerged:
- ReLU activation — gradient is 1 for positive pre-activations, eliminating repeated multiplication by values less than 1.
- Careful weight initialisation — He initialisation for ReLU (weights drawn from N(0, 2/nₗ)) and Glorot/Xavier initialisation for tanh maintain signal variance across layers.
- Residual connections (He et al., ResNet 2015) — skip connections that add the input of a block directly to its output, creating gradient highways: ∂L/∂W_early = ∂L/∂W_late + (∂L/∂W_late · ∂block/∂W_early). The additive term ensures gradient reaches early layers regardless of depth.
- Batch normalisation — controls the distribution of pre-activations, keeping them in regions where gradients are healthy (see Section 8).
The exploding gradient problem (product > 1 at each step) is addressed by gradient clipping: if ||∇L|| > threshold, rescale ∇L → threshold · ∇L / ||∇L||.
Backpropagation Visualiser See gradient flow layer by layer and observe vanishing gradients live8. Batch Normalisation
Proposed by Ioffe and Szegedy in 2015, batch normalisation (BN) normalises the pre-activations within each mini-batch to have zero mean and unit variance, then applies learnable scale γ and shift β parameters:
σ²_B = (1/B) · Σ (z_i − μ_B)² (mini-batch variance)
z_hat_i = (z_i − μ_B) / √(σ²_B + ε)
BN(z_i) = γ · z_hat_i + β
where ε ≈ 10⁻⁵ is a small constant for numerical stability. The learnable parameters γ and β allow the network to undo the normalisation if needed, preserving representational capacity.
Batch normalisation has several beneficial effects:
- Reduces internal covariate shift — keeps each layer's input distribution stable during training, allowing higher learning rates.
- Acts as regulariser — the noise from mini-batch statistics has a regularising effect similar to dropout, often reducing the need for explicit dropout.
- Smooths the loss landscape — theoretical analysis (Santurkar et al., 2018) shows BN makes the loss surface smoother and gradients more predictable, which is its primary benefit.
- Mitigates vanishing/exploding gradients — by keeping pre-activations in a healthy range.
At inference time, the mini-batch statistics are replaced by running averages computed during training. For sequence models and small-batch settings, Layer Normalisation (normalising across features rather than the batch dimension) is preferred — it powers every modern Transformer.
9. Universal Approximation Theorem
Why do neural networks work at all? The theoretical grounding is the Universal Approximation Theorem, proved in various forms by Cybenko (1989), Hornik (1991), and Barron (1993).
The classical version states:
In plain language: a single hidden layer with enough neurons can approximate any continuous function arbitrarily well. This establishes that MLPs are not limited to simple hypotheses — they are, in principle, as expressive as any model.
However, the theorem comes with important caveats:
- Width vs depth — universal approximation tells us a shallow network can represent any function, but the required width may be exponential in the input dimension. Deep networks can represent many functions exponentially more efficiently (in number of parameters) than shallow ones.
- Approximation ≠ learning — the theorem guarantees existence of the right weights, but says nothing about whether SGD will find them. Optimisation and generalisation are separate questions.
- Generalisation — a network that perfectly fits training data may fail on unseen examples. Regularisation (weight decay, dropout, data augmentation) is essential for models that generalise.
More modern results (Barron 1993; Eldan and Shamir 2016) quantify the depth-width tradeoff precisely. For functions with bounded Fourier energy, a two-layer network needs O(1/ε²) neurons, while deeper networks can achieve the same approximation quality with exponentially fewer neurons. Depth is not just convenient — it is computationally fundamental.
Together, the Universal Approximation Theorem and the efficiency of backpropagation explain why neural networks became the dominant paradigm in machine learning: they are both expressive enough to model the complexity of the real world and trainable enough that gradient descent actually finds good solutions.