Here is a surprising fact: in a high-dimensional neural network loss surface, the vast majority of critical points (where the gradient is zero) are saddle points, not local minima. A 2014 analysis by Dauphin et al. showed that the probability of a critical point being a local minimum decreases exponentially with the number of parameters. That means vanilla stochastic gradient descent (SGD) spends an alarming fraction of training time crawling across nearly-flat saddle-point plateaus rather than descending toward a good solution. Understanding why this happens — and what each modern optimiser does about it — is one of the most practically useful topics in machine learning.
1. Vanilla SGD and the Saddle-Point Problem
Standard gradient descent updates each parameter by a small step opposite to the gradient of the loss:
θ ← θ − η · ∇θL
where η is the learning rate and L is the loss. In the stochastic variant (SGD), the gradient is estimated from a random mini-batch rather than the full dataset, which introduces noise but allows each parameter update to be computed in constant time regardless of dataset size.
The problem becomes apparent on a loss landscape with mixed curvature. At a saddle point, some directions curve upward (positive eigenvalues of the Hessian) and others curve downward (negative eigenvalues). The gradient at the saddle is approximately zero, so the SGD update Δθ ≈ 0. The optimiser stalls. In low dimensions this is merely annoying; in a network with millions of parameters, the probability that every dimension curves upward simultaneously (a true local minimum) is vanishingly small — so almost every flat region is a saddle, not a minimum.
A second pathology is the ravine: a narrow curved valley where the loss is steep in one direction and nearly flat in another. SGD oscillates across the steep walls while barely advancing along the valley floor, wasting steps and requiring a small learning rate to remain stable.
2. Momentum: Accumulating Velocity Through Flat Regions
The classical fix is momentum, introduced to machine learning by Polyak (1964) and popularised by Rumelhart, Hinton & Williams (1986) in the context of backpropagation. Instead of moving purely in the direction of the current gradient, we accumulate a running velocity vector:
v ← β · v − η · ∇θL
θ ← θ + v
The hyperparameter β (typically 0.9) controls how much previous velocity is retained. Physically, the optimiser behaves like a ball rolling down a hill: it picks up speed through consistent gradient directions and slows down when gradients oscillate. Across a saddle plateau, the accumulated velocity from the descent that preceded it carries the ball through rather than letting it stall. In a ravine, oscillating gradients cancel in the cross-valley direction while reinforcing along the valley floor.
A popular variant is Nesterov Accelerated Gradient (NAG), which evaluates the gradient at the lookahead position θ + βv rather than the current position. This gives a more accurate gradient estimate and yields faster convergence in convex settings with a provably better O(1/k²) rate versus O(1/k) for plain SGD.
3. RMSprop and Adam: Adaptive Per-Dimension Learning Rates
Momentum addresses the temporal dimension of optimisation (using past gradients to build velocity) but treats every parameter dimension with the same learning rate. In practice, some parameters receive large, frequent gradient signals (e.g., weights connected to common words in an embedding layer) while others receive small, rare signals. A fixed global η is a poor fit for both.
RMSprop
Geoffrey Hinton proposed RMSprop (unpublished, ca. 2012) as an improvement over AdaGrad, which accumulates squared gradients without forgetting. RMSprop maintains an exponential moving average of squared gradients per parameter:
s ← ρ · s + (1 − ρ) · (∇θL)²
θ ← θ − (η / √(s + ε)) · ∇θL
Here ρ (typically 0.9–0.99) controls the decay rate and ε (e.g., 10−8) prevents division by zero. For a parameter dimension with consistently large gradients, s is large, so the effective step size η/√s shrinks automatically. For a dimension with small or infrequent gradients, s stays small and the effective learning rate remains large. The result is normalisation across dimensions: each parameter adapts at a rate appropriate to its own gradient history.
Adam: Adaptive Moment Estimation
Adam (Kingma & Ba, 2015) fuses momentum with RMSprop-style scaling and adds bias correction for the early training steps when the moment estimates are initialised at zero. It maintains two running statistics:
m ← β1 · m + (1 − β1) · ∇θL (first moment: mean)
v ← β2 · v + (1 − β2) · (∇θL)² (second moment: uncentred variance)
m̂ = m / (1 − β1t)
v̂ = v / (1 − β2t)
θ ← θ − η · m̂ / (√v̂ + ε)
Default hyperparameters are β1 = 0.9, β2 = 0.999, ε = 10−8, and η = 10−3. These defaults work surprisingly well across a wide range of architectures, which is a major reason for Adam’s dominance: practitioners rarely need to tune beyond the learning rate.
The bias-correction terms m̂ and v̂ matter most in early training (small t). At step 1 with β1 = 0.9, the raw first moment m = 0.1 · g would severely underestimate the true mean gradient; dividing by 1 − 0.9 = 0.1 recovers the correct scale. As t → ∞ the correction terms approach 1 and Adam converges in behaviour toward RMSprop with momentum.
Try It Yourself
These mysimulator.uk simulations let you watch the optimisation dynamics unfold in real time — adjust learning rate, momentum, and noise level and observe the path each algorithm takes across the loss landscape:
- Gradient Descent Visualiser → — place a starting point on a 2-D loss landscape and compare SGD, momentum, RMSprop, and Adam paths side by side. Watch saddle-point stalling and ravine oscillation in vanilla SGD disappear with momentum.
- Neural Network Training → — train a small multi-layer network on a classification task, switching optimiser mid-run. See how loss curves and weight distributions evolve differently under each algorithm.
- Loss Landscape Explorer → — fly through a 3-D projection of a real network’s loss surface and mark critical points to count saddles versus minima at different network widths.
Closing Thought
The progression from SGD to momentum to Adam is not merely a history of engineering improvements — it reflects a deepening understanding of loss landscape geometry. Vanilla SGD assumes all dimensions are equally difficult; momentum recognises that history matters; RMSprop recognises that dimensions are not equal; Adam recognises both simultaneously. Each insight removed a failure mode that practitioners had hit repeatedly in real training runs.
The open question for the next generation of optimisers is curvature information: second-order methods like K-FAC and Shampoo approximate the Hessian or its Fisher-matrix proxy to take more geometrically informed steps. They are more expensive per iteration but require far fewer steps to converge, and as hardware scales the tradeoff becomes increasingly attractive. The fundamental battle — escaping flat regions and navigating ill-conditioned curvature — will remain at the heart of deep learning optimisation for years to come.