Machine Learning Fundamentals — Gradient Descent, Backprop & Neural Networks from Scratch

1. Gradient Descent

Machine learning reduces to one universal problem: find the parameter vector θ that minimises a scalar loss function L(θ). Gradient descent does this by repeatedly stepping in the direction of steepest descent:

SGD and Variants

Vanilla SGD: θ ← θ − η · ∇L(θ)

Momentum (heavy-ball):
  v ← β v − η · ∇L(θ)
  θ ← θ + v

Adam (adaptive moments):
  m ← β₁ m + (1−β₁) g   (1st moment)
  v ← β₂ v + (1−β₂) g² (2nd moment)
  θ ← θ − η · m̂ / (√v̂ + ε)

The loss landscape for small networks is a surface in high-dimensional space. Saddle points are far more common than local minima — modern optimisers escape them through noise and adaptive step-sizes.

⛰️ Gradient Descent Visualiser Watch SGD, momentum, RMSProp, and Adam descend different 2D loss landscapes in real time. Explore saddle points, local minima, and narrow ravines to build intuition for optimizer behaviour. SGD · Momentum · RMSProp · Adam 🕸️ Neural Network Trainer Train a multi-layer perceptron on four classification datasets in the browser. Configure layers, activation functions, batch size, and watch accuracy and loss update in real time. Backpropagation · minibatch SGD · ReLU / Tanh / Sigmoid

2. Backpropagation — the Chain Rule Applied

Backprop is nothing more than the chain rule of calculus applied systematically to a computation graph. For a loss L = f(g(x)):

Chain Rule — Forward & Backward Pass

Forward: z¹ = W¹x + b¹, a¹ = σ(z¹), z² = W²a¹ + b², ŷ = softmax(z²)
Loss: L = −Σ yᵢ log ŷᵢ (cross-entropy)

Backward: ∂L/∂W² = (∂L/∂ŷ)(∂ŷ/∂z²)(∂z²/∂W²) = (ŷ − y)ᵀ · a¹
∂L/∂W¹ = (W²ᵀ δ²) ⊙ σ′(z¹) · xᵀ (where δ² = ŷ − y)

The key insight: gradients flow backwards through each layer, each multiplied by the local Jacobian. With ReLU activations (σ′ = 0 or 1) the gradient either passes through unmodified or is zero — which is both the source of efficiency and the "dying ReLU" pathology.

// Minimal backprop for a 2-layer network (JavaScript pseudocode)
function backward(x, y, W1, W2, b1, b2) {
  // Forward
  const z1 = matMul(W1, x).add(b1);
  const a1 = relu(z1);
  const z2 = matMul(W2, a1).add(b2);
  const yHat = softmax(z2);

  // Output layer gradient (softmax + cross-entropy simplifies nicely)
  const dz2 = yHat.sub(y);                    // (C × 1)
  const dW2 = matMul(dz2, a1.T);             // (C × H)

  // Hidden layer gradient
  const da1 = matMul(W2.T, dz2);             // (H × 1)
  const dz1 = da1.mul(reluGrad(z1));         // element-wise ReLU′
  const dW1 = matMul(dz1, x.T);             // (H × D)

  return { dW1, dW2, db1: dz1, db2: dz2 };
}

3. Convolutional Intuition

A convolutional layer slides a small kernel (e.g. 3×3) over the input. Each position produces one scalar by computing a dot product between the kernel weights and the local receptive field. This parameter sharing — re-using the same kernel at every location — gives CNNs translation invariance and dramatically reduces parameter count compared with fully-connected alternatives.

4. Reinforcement Learning & Evolution

🎰 Reinforcement Learning Q-learning and policy gradient agents learning to navigate, balance, and collect rewards in a gridworld environment. Visualise the value function evolving in real time. Q-learning · ε-greedy · REINFORCE policy gradient 🧬 Natural Selection Genetic algorithm evolving creature morphologies and locomotion strategies. The fitness landscape shapes population trajectories over hundreds of generations. Genetic algorithm · crossover & mutation · fitness selection

🗺️ Self-Organising Map Unsupervised topology-preserving dimensionality reduction. A 2D grid of neurons self-organises to reflect the structure of high-dimensional input data during competitive learning. Kohonen SOM · best-matching unit · neighbourhood function 🌳 Decision Tree Visualiser Watch a CART decision tree grow node by node, splitting on Gini impurity at each step. Change the training data and max depth to observe overfitting and pruning effects. CART · Gini impurity · information gain · pruning

Algorithms at a Glance

Stochastic Gradient Descent Adam optimiser Backpropagation Chain rule Softmax cross-entropy ReLU /Tanh Convolutional layers Q-learning Policy gradient Genetic algorithm Kohonen SOM CART Gini impurity

Why does ReLU work better than sigmoid? The sigmoid saturates at 0 and 1 — gradients vanish for large activations (the "vanishing gradient" problem). ReLU is linear for positive inputs, so gradients flow through without shrinkage. Variants like Leaky ReLU (slope 0.01 for x < 0) and GELU (Gaussian-error linear unit, used in transformers) address the "dying ReLU" problem where neurons permanently output zero.