🧠 Machine Learning · Deep Learning

📅 March 2026 ⏱ ~9 min read 🟡 Intermediate

How Neural Networks Learn

A neural network is just a big nested function. Learning is just finding the input parameters (weights) that minimise a number (the loss). The maths — matrix multiplications, partial derivatives, gradient descent — is surprisingly accessible once you see the whole pipeline at once.

Architecture: Layers and Weights

A neural network consists of layers of neurons. Each connection between neurons has a weight (a number, typically between −1 and 1) and each neuron has a bias. Learning is entirely about finding the right values for these numbers.

x₁

x₂

x₃

Input

→

h₁

h₂

h₃

h₄

Hidden

→

ŷ₁

ŷ₂

Output

A network with n_in inputs and n_h hidden neurons has a weight matrix W of shape [n_h × n_in] for the first layer, and a bias vector b of shape [n_h]. The second layer has its own weight matrix and bias.

A typical small image-classification network (MNIST, 28×28 pixels) with one hidden layer of 128 neurons has 784×128 + 128 + 128×10 + 10 = 101,770 weights. GPT-4 has an estimated 1.8 trillion.

Forward Pass: Computing a Prediction

To make a prediction, you pass data through the network from left to right. For a single hidden layer, the computation is:

h = activate( W₁ · x + b₁ )
ŷ = W₂ · h + b₂

Where · is matrix–vector multiplication. In Python/NumPy this is literally:

import numpy as np

# Single forward pass (one sample, batch_size=1)
h = relu(W1 @ x + b1)   # shape: [n_hidden]
y_hat = W2 @ h + b2      # shape: [n_out]

This cascades through as many layers as you have — each layer receives the previous layer's activations as its input. The final output is the network's raw prediction.

Activation Functions

Without an activation function, any stack of linear layers is equivalent to a single linear layer — no matter how deep. You could collapse the entire network into one matrix multiplication.

Activation functions introduce non-linearity, which allows the network to approximate any function, not just linear ones.

ReLU (Rectified Linear Unit)

ReLU(x) = max(0, x)

The most widely used. Simply outputs 0 for negative inputs and x for positive ones. Very fast to compute and doesn't suffer from vanishing gradients in shallow-to-medium networks.

Sigmoid

σ(x) = 1 / (1 + e^−x)

Squashes output to (0, 1). Used in the output layer for binary classification (probability that the answer is "yes").

Softmax

softmax(xᵢ) = e^xᵢ / Σⱼ e^xⱼ

Like sigmoid but for multiple classes. Turns logits into a probability distribution that sums to 1. Used in the output layer for multi-class classification.

Loss Functions: Measuring the Error

The loss function (or cost function) measures how wrong the prediction is. Learning is about minimising this number.

Mean Squared Error (MSE) — for regression

L = (1/N) × Σ (yᵢ − ŷᵢ)²

Average squared difference between predicted and actual values. Squaring penalises large errors more heavily.

Cross-Entropy — for classification

L = −Σ yᵢ × log(ŷᵢ)

Measures the difference between the predicted probability distribution and the true one. If the network is 99% confident and correct, loss → 0. If 99% confident and wrong, loss → ∞.

Why not just use accuracy? Accuracy is not differentiable — changing a weight by a tiny amount typically doesn't change accuracy at all (you either get the answer right or wrong). Loss functions like cross-entropy change smoothly with every tiny weight adjustment, which is what gradient descent needs.

Backpropagation: The Chain Rule

Backpropagation computes how much each weight contributed to the loss — i.e., the gradient ∂L/∂wᵢⱼ for every weight. This is done efficiently using the chain rule from calculus.

The chain rule says: if y depends on u, and u depends on x, then:

∂y/∂x = (∂y/∂u) × (∂u/∂x)

For a network with layers L₁ → L₂ → L₃ → Loss, we compute gradients working backwards:

Compute ∂Loss/∂L₃ (gradient of loss w.r.t. output)
Multiply by ∂L₃/∂L₂ (using chain rule through layer 3 weights)
Multiply by ∂L₂/∂L₁ (chain rule through layer 2)
Continue to every weight in the network

Modern libraries like PyTorch use automatic differentiation (autograd) — the computation graph is tracked during the forward pass, and gradients are computed exactly by traversing it in reverse. You never write the chain rule by hand.

# PyTorch training step
optimizer.zero_grad()    # clear old gradients
y_hat = model(x)         # forward pass
loss = criterion(y_hat, y)  # compute loss
loss.backward()          # backprop: compute all ∂L/∂w
optimizer.step()         # update weights

Gradient Descent and Optimisers

Once we have the gradient ∂L/∂w for every weight, we move the weights slightly in the direction that reduces the loss:

w ← w − η × ∂L/∂w

Where η (eta) is the learning rate — how big a step we take. Too large: overshoots the minimum. Too small: takes forever.

SGD → Adam → AdamW

SGD (Stochastic Gradient Descent): Update weights after each random sample (or mini-batch). Noisy but fast.
Momentum: Keep a running average of previous gradients to smooth the update direction. Like a ball rolling downhill that builds up speed.
Adam: Combines momentum with per-weight adaptive learning rates. The default optimiser for most deep learning. Maintains running averages of both gradients (m) and squared gradients (v), then computes: w ← w − η × m̂/(√v̂ + ε).
AdamW: Adam + weight decay (L2 regularisation). Default for training language models.

The Full Training Loop

for epoch in range(NUM_EPOCHS):
    for x_batch, y_batch in dataloader:
        # 1. Forward pass
        predictions = model(x_batch)

        # 2. Compute loss
        loss = loss_fn(predictions, y_batch)

        # 3. Backward pass (compute gradients)
        optimizer.zero_grad()
        loss.backward()

        # 4. Update weights
        optimizer.step()

    # 5. Evaluate on validation set
    val_loss = evaluate(model, val_loader)
    print(f"Epoch {epoch}: train={loss:.4f} val={val_loss:.4f}")

A single pass through all training data is called an epoch. Networks typically train for 10–100 epochs. Each epoch uses the entire dataset in small random batches (mini-batch gradient descent) for efficiency.

Overfitting: If validation loss starts rising while training loss keeps falling, the network has memorised the training data rather than learning general patterns. Fixes include: more data, dropout, weight decay, data augmentation, or early stopping.

Try It Yourself

Neural Network from Scratch Simulation — Visualise a training neural network in real time: watch weights update, loss decrease, and a decision boundary form.

🧠 Open Neural Network →