Learning #43: Optimisation — Gradient Descent, Momentum & Adam on Loss Landscapes

Why vanilla SGD gets trapped in saddle points, how momentum carries it through flat valleys, how RMSprop rescales per dimension, and why Adam has become the de-facto default for deep learning — explained from first principles with the mathematics you need to understand them.

Here is a surprising fact: in a high-dimensional neural network loss surface, the vast majority of critical points (where the gradient is zero) are saddle points, not local minima. A 2014 analysis by Dauphin et al. showed that the probability of a critical point being a local minimum decreases exponentially with the number of parameters. That means vanilla stochastic gradient descent (SGD) spends an alarming fraction of training time crawling across nearly-flat saddle-point plateaus rather than descending toward a good solution. Understanding why this happens — and what each modern optimiser does about it — is one of the most practically useful topics in machine learning.

1. Vanilla SGD and the Saddle-Point Problem

Standard gradient descent updates each parameter by a small step opposite to the gradient of the loss:

θ ← θ − η · ∇θL

where η is the learning rate and L is the loss. In the stochastic variant (SGD), the gradient is estimated from a random mini-batch rather than the full dataset, which introduces noise but allows each parameter update to be computed in constant time regardless of dataset size.

The problem becomes apparent on a loss landscape with mixed curvature. At a saddle point, some directions curve upward (positive eigenvalues of the Hessian) and others curve downward (negative eigenvalues). The gradient at the saddle is approximately zero, so the SGD update Δθ ≈ 0. The optimiser stalls. In low dimensions this is merely annoying; in a network with millions of parameters, the probability that every dimension curves upward simultaneously (a true local minimum) is vanishingly small — so almost every flat region is a saddle, not a minimum.

A second pathology is the ravine: a narrow curved valley where the loss is steep in one direction and nearly flat in another. SGD oscillates across the steep walls while barely advancing along the valley floor, wasting steps and requiring a small learning rate to remain stable.

2. Momentum: Accumulating Velocity Through Flat Regions

The classical fix is momentum, introduced to machine learning by Polyak (1964) and popularised by Rumelhart, Hinton & Williams (1986) in the context of backpropagation. Instead of moving purely in the direction of the current gradient, we accumulate a running velocity vector:

v ← β · v − η · ∇θL
θ ← θ + v

The hyperparameter β (typically 0.9) controls how much previous velocity is retained. Physically, the optimiser behaves like a ball rolling down a hill: it picks up speed through consistent gradient directions and slows down when gradients oscillate. Across a saddle plateau, the accumulated velocity from the descent that preceded it carries the ball through rather than letting it stall. In a ravine, oscillating gradients cancel in the cross-valley direction while reinforcing along the valley floor.

A popular variant is Nesterov Accelerated Gradient (NAG), which evaluates the gradient at the lookahead position θ + βv rather than the current position. This gives a more accurate gradient estimate and yields faster convergence in convex settings with a provably better O(1/k²) rate versus O(1/k) for plain SGD.

Intuition check: momentum does not escape saddle points because the gradient is non-zero — it is approximately zero at the saddle. It escapes because noise in the stochastic gradient estimate provides a small perturbation, and accumulated velocity (from the approach) carries the optimiser through before it fully stalls.

3. RMSprop and Adam: Adaptive Per-Dimension Learning Rates

Momentum addresses the temporal dimension of optimisation (using past gradients to build velocity) but treats every parameter dimension with the same learning rate. In practice, some parameters receive large, frequent gradient signals (e.g., weights connected to common words in an embedding layer) while others receive small, rare signals. A fixed global η is a poor fit for both.

RMSprop

Geoffrey Hinton proposed RMSprop (unpublished, ca. 2012) as an improvement over AdaGrad, which accumulates squared gradients without forgetting. RMSprop maintains an exponential moving average of squared gradients per parameter:

s ← ρ · s + (1 − ρ) · (∇θL)²
θ ← θ − (η / √(s + ε)) · ∇θL

Here ρ (typically 0.9–0.99) controls the decay rate and ε (e.g., 10−8) prevents division by zero. For a parameter dimension with consistently large gradients, s is large, so the effective step size η/√s shrinks automatically. For a dimension with small or infrequent gradients, s stays small and the effective learning rate remains large. The result is normalisation across dimensions: each parameter adapts at a rate appropriate to its own gradient history.

Adam: Adaptive Moment Estimation

Adam (Kingma & Ba, 2015) fuses momentum with RMSprop-style scaling and adds bias correction for the early training steps when the moment estimates are initialised at zero. It maintains two running statistics:

m ← β1 · m + (1 − β1) · ∇θL       (first moment: mean)
v ← β2 · v + (1 − β2) · (∇θL)²  (second moment: uncentred variance)

m̂ = m / (1 − β1t)
v̂ = v / (1 − β2t)

θ ← θ − η · m̂ / (√v̂ + ε)

Default hyperparameters are β1 = 0.9, β2 = 0.999, ε = 10−8, and η = 10−3. These defaults work surprisingly well across a wide range of architectures, which is a major reason for Adam’s dominance: practitioners rarely need to tune beyond the learning rate.

The bias-correction terms and matter most in early training (small t). At step 1 with β1 = 0.9, the raw first moment m = 0.1 · g would severely underestimate the true mean gradient; dividing by 1 − 0.9 = 0.1 recovers the correct scale. As t → ∞ the correction terms approach 1 and Adam converges in behaviour toward RMSprop with momentum.

Why Adam sometimes fails: several papers (notably Wilson et al., 2017) demonstrated that adaptive methods can find solutions that generalise worse than SGD with momentum on image classification benchmarks. The current consensus is that this is tied to the larger effective learning rates for rare-gradient dimensions causing sharp minima. Variants such as AdamW (decoupled weight decay) and Adan address some of these shortcomings and are now preferred in many large-model training runs.

Try It Yourself

These mysimulator.uk simulations let you watch the optimisation dynamics unfold in real time — adjust learning rate, momentum, and noise level and observe the path each algorithm takes across the loss landscape:

Closing Thought

The progression from SGD to momentum to Adam is not merely a history of engineering improvements — it reflects a deepening understanding of loss landscape geometry. Vanilla SGD assumes all dimensions are equally difficult; momentum recognises that history matters; RMSprop recognises that dimensions are not equal; Adam recognises both simultaneously. Each insight removed a failure mode that practitioners had hit repeatedly in real training runs.

The open question for the next generation of optimisers is curvature information: second-order methods like K-FAC and Shampoo approximate the Hessian or its Fisher-matrix proxy to take more geometrically informed steps. They are more expensive per iteration but require far fewer steps to converge, and as hardware scales the tradeoff becomes increasingly attractive. The fundamental battle — escaping flat regions and navigating ill-conditioned curvature — will remain at the heart of deep learning optimisation for years to come.