How Neural Networks Learn
A neural network is just a big nested function. Learning is just finding the input parameters (weights) that minimise a number (the loss). The maths — matrix multiplications, partial derivatives, gradient descent — is surprisingly accessible once you see the whole pipeline at once.
Architecture: Layers and Weights
A neural network consists of layers of neurons. Each connection between neurons has a weight (a number, typically between −1 and 1) and each neuron has a bias. Learning is entirely about finding the right values for these numbers.
A network with nin inputs and nh hidden neurons has a weight matrix W of shape [nh × nin] for the first layer, and a bias vector b of shape [nh]. The second layer has its own weight matrix and bias.
A typical small image-classification network (MNIST, 28×28 pixels) with one hidden layer of 128 neurons has 784×128 + 128 + 128×10 + 10 = 101,770 weights. GPT-4 has an estimated 1.8 trillion.
Forward Pass: Computing a Prediction
To make a prediction, you pass data through the network from left to right. For a single hidden layer, the computation is:
ŷ = W₂ · h + b₂
Where · is matrix–vector multiplication. In Python/NumPy this is literally:
This cascades through as many layers as you have — each layer receives the previous layer's activations as its input. The final output is the network's raw prediction.
Activation Functions
Without an activation function, any stack of linear layers is equivalent to a single linear layer — no matter how deep. You could collapse the entire network into one matrix multiplication.
Activation functions introduce non-linearity, which allows the network to approximate any function, not just linear ones.
ReLU (Rectified Linear Unit)
The most widely used. Simply outputs 0 for negative inputs and x for positive ones. Very fast to compute and doesn't suffer from vanishing gradients in shallow-to-medium networks.
Sigmoid
Squashes output to (0, 1). Used in the output layer for binary classification (probability that the answer is "yes").
Softmax
Like sigmoid but for multiple classes. Turns logits into a probability distribution that sums to 1. Used in the output layer for multi-class classification.
Loss Functions: Measuring the Error
The loss function (or cost function) measures how wrong the prediction is. Learning is about minimising this number.
Mean Squared Error (MSE) — for regression
Average squared difference between predicted and actual values. Squaring penalises large errors more heavily.
Cross-Entropy — for classification
Measures the difference between the predicted probability distribution and the true one. If the network is 99% confident and correct, loss → 0. If 99% confident and wrong, loss → ∞.
Backpropagation: The Chain Rule
Backpropagation computes how much each weight contributed to the loss — i.e., the gradient ∂L/∂wᵢⱼ for every weight. This is done efficiently using the chain rule from calculus.
The chain rule says: if y depends on u, and u depends on x, then:
For a network with layers L₁ → L₂ → L₃ → Loss, we compute gradients working backwards:
- Compute ∂Loss/∂L₃ (gradient of loss w.r.t. output)
- Multiply by ∂L₃/∂L₂ (using chain rule through layer 3 weights)
- Multiply by ∂L₂/∂L₁ (chain rule through layer 2)
- Continue to every weight in the network
Modern libraries like PyTorch use automatic differentiation (autograd) — the computation graph is tracked during the forward pass, and gradients are computed exactly by traversing it in reverse. You never write the chain rule by hand.
Gradient Descent and Optimisers
Once we have the gradient ∂L/∂w for every weight, we move the weights slightly in the direction that reduces the loss:
Where η (eta) is the learning rate — how big a step we take. Too large: overshoots the minimum. Too small: takes forever.
SGD → Adam → AdamW
- SGD (Stochastic Gradient Descent): Update weights after each random sample (or mini-batch). Noisy but fast.
- Momentum: Keep a running average of previous gradients to smooth the update direction. Like a ball rolling downhill that builds up speed.
- Adam: Combines momentum with per-weight adaptive learning rates. The default optimiser for most deep learning. Maintains running averages of both gradients (m) and squared gradients (v), then computes: w ← w − η × m̂/(√v̂ + ε).
- AdamW: Adam + weight decay (L2 regularisation). Default for training language models.
The Full Training Loop
A single pass through all training data is called an epoch. Networks typically train for 10–100 epochs. Each epoch uses the entire dataset in small random batches (mini-batch gradient descent) for efficiency.
Try It Yourself
- Neural Network from Scratch Simulation — Visualise a training neural network in real time: watch weights update, loss decrease, and a decision boundary form.