This simulation visualises how four popular optimisation algorithms navigate a 3D loss landscape in real time. The coloured spheres represent parameter pairs (x, y), and the coloured trails show the path each optimiser has taken.
Use the controls to switch between loss surfaces, adjust the learning rate and momentum, and toggle individual optimisers on and off. Drag the 3D scene to rotate; scroll to zoom.
Numerical gradient (central diff): df/dx = (f(x+h,y) - f(x-h,y)) / (2h) h = 1e-4 Adam bias-corrected moments: m_hat = m / (1 - β₁^t) v_hat = v / (1 - β₂^t) θ -= α · m_hat / (√v_hat + ε)
Adam was introduced by Diederik Kingma and Jimmy Ba in 2014 and quickly became the default optimiser for training deep neural networks. Its name comes from Adaptive Moment Estimation. On the Rosenbrock banana function — notoriously tricky due to its narrow curved valley — Adam typically converges in a fraction of the steps required by plain SGD.
This simulation drops four optimisers — SGD, Momentum, RMSprop and Adam — onto the same 3D loss surface and lets you watch them race towards a minimum. Pick from four classic optimisation test functions (Rosenbrock, Rastrigin, Himmelblau, Beale), each with its own known trap for gradient-based methods, from narrow curved valleys to fields of local minima. Every optimiser starts from the same point and takes its update rule from the real formulas used to train neural networks, so the coloured trails you see are genuine algorithm behaviour, not a scripted animation. Gradients are computed numerically with a central finite difference rather than symbolically, exactly as a from-scratch implementation would.
A live 3D mesh of a chosen loss function (Rosenbrock, Rastrigin, Himmelblau or Beale), coloured by height, with four spheres — one per optimiser — descending it in parallel. Each sphere leaves a coloured trail (red SGD, teal Momentum, yellow RMSprop, green Adam) so you can compare paths, speed and whether an optimiser gets stuck in a local minimum or oscillates across a valley.
Choose a loss surface from the dropdown, then tune Learning Rate (log scale), Momentum β₁, Adam β₂ and Steps/frame. Toggle any of the four optimisers on or off to isolate its behaviour, drag to orbit the 3D scene, scroll to zoom, and use Pause/Reset to freeze the run or restart every optimiser from its shared starting point.
Adam (Adaptive Moment Estimation) was introduced by Kingma and Ba in 2014 and combines the ideas behind Momentum and RMSprop: it tracks a moving average of both the gradient and its square, then bias-corrects them so early steps aren't biased towards zero. That combination is why Adam is the default optimiser for most deep learning today.
Gradient descent is an iterative optimisation algorithm that minimises a loss function by repeatedly moving parameters in the direction of the negative gradient. It is the backbone of training neural networks and most machine-learning models.
The learning rate controls the step size taken in the direction of the negative gradient. Too large a rate causes the optimiser to overshoot minima and diverge; too small a rate makes training very slow. Adaptive methods like Adam adjust effective learning rates per parameter automatically.
SGD (Stochastic Gradient Descent) updates parameters by subtracting a fixed fraction of the gradient. Adam (Adaptive Moment Estimation) maintains exponential moving averages of both gradients and squared gradients, giving each parameter its own adaptive learning rate with bias correction, which typically converges much faster on complex surfaces.
The Rosenbrock function f(x,y) = (1−x)² + 100(y−x²)² has its global minimum at (1,1) inside a narrow, curved valley. The gradient along the valley floor is tiny compared to the gradient across it, so plain SGD tends to zig-zag slowly along the valley walls while Momentum and Adam accelerate through it far faster.
Momentum accumulates a velocity vector, so it keeps accelerating in directions where the gradient consistently points the same way, which suits smooth valleys like Rosenbrock's. Adam additionally normalises each parameter's step by a running estimate of its squared gradient, which helps more on landscapes with very different curvature in different directions, such as Rastrigin's many small bumps.