About this simulation

This simulation drops four optimisers — SGD, Momentum, RMSprop and Adam — onto the same 3D loss surface and lets you watch them race towards a minimum. Pick from four classic optimisation test functions (Rosenbrock, Rastrigin, Himmelblau, Beale), each with its own known trap for gradient-based methods, from narrow curved valleys to fields of local minima. Every optimiser starts from the same point and takes its update rule from the real formulas used to train neural networks, so the coloured trails you see are genuine algorithm behaviour, not a scripted animation. Gradients are computed numerically with a central finite difference rather than symbolically, exactly as a from-scratch implementation would.

🔬 What it shows

A live 3D mesh of a chosen loss function (Rosenbrock, Rastrigin, Himmelblau or Beale), coloured by height, with four spheres — one per optimiser — descending it in parallel. Each sphere leaves a coloured trail (red SGD, teal Momentum, yellow RMSprop, green Adam) so you can compare paths, speed and whether an optimiser gets stuck in a local minimum or oscillates across a valley.

🎮 How to use

Choose a loss surface from the dropdown, then tune Learning Rate (log scale), Momentum β₁, Adam β₂ and Steps/frame. Toggle any of the four optimisers on or off to isolate its behaviour, drag to orbit the 3D scene, scroll to zoom, and use Pause/Reset to freeze the run or restart every optimiser from its shared starting point.

💡 Did you know?

Adam (Adaptive Moment Estimation) was introduced by Kingma and Ba in 2014 and combines the ideas behind Momentum and RMSprop: it tracks a moving average of both the gradient and its square, then bias-corrects them so early steps aren't biased towards zero. That combination is why Adam is the default optimiser for most deep learning today.

Frequently asked questions

What is gradient descent?

Gradient descent is an iterative optimisation algorithm that minimises a loss function by repeatedly moving parameters in the direction of the negative gradient. It is the backbone of training neural networks and most machine-learning models.

How does the learning rate affect convergence?

The learning rate controls the step size taken in the direction of the negative gradient. Too large a rate causes the optimiser to overshoot minima and diverge; too small a rate makes training very slow. Adaptive methods like Adam adjust effective learning rates per parameter automatically.

What is the difference between SGD and Adam?

SGD (Stochastic Gradient Descent) updates parameters by subtracting a fixed fraction of the gradient. Adam (Adaptive Moment Estimation) maintains exponential moving averages of both gradients and squared gradients, giving each parameter its own adaptive learning rate with bias correction, which typically converges much faster on complex surfaces.

What is the Rosenbrock function, and why is it a hard test?

The Rosenbrock function f(x,y) = (1−x)² + 100(y−x²)² has its global minimum at (1,1) inside a narrow, curved valley. The gradient along the valley floor is tiny compared to the gradient across it, so plain SGD tends to zig-zag slowly along the valley walls while Momentum and Adam accelerate through it far faster.

Why does Momentum sometimes outrun Adam, and sometimes not?

Momentum accumulates a velocity vector, so it keeps accelerating in directions where the gradient consistently points the same way, which suits smooth valleys like Rosenbrock's. Adam additionally normalises each parameter's step by a running estimate of its squared gradient, which helps more on landscapes with very different curvature in different directions, such as Rastrigin's many small bumps.