1. Gradient Descent
Machine learning reduces to one universal problem: find the parameter vector θ that minimises a scalar loss function L(θ). Gradient descent does this by repeatedly stepping in the direction of steepest descent:
Vanilla SGD: θ ← θ − η · ∇L(θ)
Momentum (heavy-ball):
v ← β v − η · ∇L(θ)
θ ← θ + v
Adam (adaptive moments):
m ← β₁ m + (1−β₁) g (1st moment)
v ← β₂ v + (1−β₂) g² (2nd moment)
θ ← θ − η · m̂ / (√v̂ + ε)
The loss landscape for small networks is a surface in high-dimensional space. Saddle points are far more common than local minima — modern optimisers escape them through noise and adaptive step-sizes.
2. Backpropagation — the Chain Rule Applied
Backprop is nothing more than the chain rule of calculus applied systematically to a computation graph. For a loss L = f(g(x)):
Forward: z¹ = W¹x + b¹, a¹ = σ(z¹), z² = W²a¹ + b², ŷ =
softmax(z²)
Loss: L = −Σ yᵢ log ŷᵢ (cross-entropy)
Backward: ∂L/∂W² = (∂L/∂ŷ)(∂ŷ/∂z²)(∂z²/∂W²) = (ŷ − y)ᵀ · a¹
∂L/∂W¹ = (W²ᵀ δ²) ⊙ σ′(z¹) · xᵀ (where δ² = ŷ − y)
The key insight: gradients flow backwards through each layer, each multiplied by the local Jacobian. With ReLU activations (σ′ = 0 or 1) the gradient either passes through unmodified or is zero — which is both the source of efficiency and the "dying ReLU" pathology.
// Minimal backprop for a 2-layer network (JavaScript pseudocode)
function backward(x, y, W1, W2, b1, b2) {
// Forward
const z1 = matMul(W1, x).add(b1);
const a1 = relu(z1);
const z2 = matMul(W2, a1).add(b2);
const yHat = softmax(z2);
// Output layer gradient (softmax + cross-entropy simplifies nicely)
const dz2 = yHat.sub(y); // (C × 1)
const dW2 = matMul(dz2, a1.T); // (C × H)
// Hidden layer gradient
const da1 = matMul(W2.T, dz2); // (H × 1)
const dz1 = da1.mul(reluGrad(z1)); // element-wise ReLU′
const dW1 = matMul(dz1, x.T); // (H × D)
return { dW1, dW2, db1: dz1, db2: dz2 };
}
3. Convolutional Intuition
A convolutional layer slides a small kernel (e.g. 3×3) over the input. Each position produces one scalar by computing a dot product between the kernel weights and the local receptive field. This parameter sharing — re-using the same kernel at every location — gives CNNs translation invariance and dramatically reduces parameter count compared with fully-connected alternatives.
4. Reinforcement Learning & Evolution
Algorithms at a Glance
Why does ReLU work better than sigmoid? The sigmoid saturates at 0 and 1 — gradients vanish for large activations (the "vanishing gradient" problem). ReLU is linear for positive inputs, so gradients flow through without shrinkage. Variants like Leaky ReLU (slope 0.01 for x < 0) and GELU (Gaussian-error linear unit, used in transformers) address the "dying ReLU" problem where neurons permanently output zero.