Gradient Descent & Modern Optimisers — Adam, RMSprop, Momentum

Gradient descent is the optimisation engine that powers almost every machine learning model in use today, from simple linear regression to vast deep neural networks. At its heart it is a wonderfully simple idea: to make a model better, repeatedly measure how wrong it is, work out which way to nudge each parameter to reduce that error, and take a small step in that downhill direction. Repeat thousands or millions of times and a randomly initialised model gradually learns to recognise images, translate languages or forecast demand. The reason gradient descent matters is practical and economic — training modern models is enormously expensive, so the difference between a naive optimiser and a clever one can mean days of computation, real money and whether a model converges at all. This article explains the core mechanism, then explores the momentum, RMSprop and Adam refinements that make modern training feasible.

The core mechanism: following the slope

Imagine the model's error as a landscape of hills and valleys, where the horizontal coordinates are the model's parameters and the height is the loss. Gradient descent tries to walk to the lowest valley. The gradient is a vector of partial derivatives that points in the direction of steepest increase, so we move in the opposite direction. The fundamental update rule is written as θ = θ − η · ∇J(θ), where θ is the parameter vector, η is the learning rate, and ∇J(θ) is the gradient of the loss function J with respect to the parameters.

The learning rate η controls how big each step is, and choosing it well is critical. Set it too high and the steps overshoot the valley, bouncing back and forth or diverging entirely; set it too low and training crawls, wasting compute. In practice we rarely compute the gradient over the whole dataset at once, because that is slow and memory-hungry. Instead we use mini-batch gradient descent, estimating the gradient from a small random sample of examples on each step. This introduces helpful noise that can shake the optimiser out of poor regions, and it makes each update fast. The extreme case of one example per step is stochastic gradient descent (SGD), while using the entire dataset is batch gradient descent. Mini-batch sizes of 32 to 512 are common compromises, balancing gradient quality against hardware efficiency, particularly on GPUs that thrive on parallel work.

Momentum, RMSprop and Adam: smarter steps

Plain SGD struggles when the loss surface is poorly conditioned — long, narrow ravines where it zig-zags across the walls while making slow progress along the floor. Momentum fixes this by accumulating a velocity term: an exponentially weighted moving average of past gradients. The update becomes v = β·v + ∇J(θ); θ = θ − η·v, with β typically around 0.9. Consistent gradient directions reinforce one another and build speed, while oscillating components cancel out, so the optimiser glides smoothly along the ravine floor and can coast through flat plateaus and small bumps.

RMSprop attacks a different problem: parameters can have wildly different gradient scales, so a single global learning rate suits none of them well. RMSprop keeps a running average of squared gradients and divides each step by its square root, giving every parameter its own adaptive effective learning rate. The update uses s = ρ·s + (1−ρ)·g²; θ = θ − η·g / (√s + ε), where g is the gradient, ρ is a decay factor and ε prevents division by zero. Noisy, large-gradient parameters are damped, while quiet ones are amplified.

Adam (Adaptive Moment Estimation) combines both ideas. It maintains a momentum-like first moment and an RMSprop-like second moment, then applies bias correction so the early estimates are not skewed toward zero. With sensible defaults (β₁=0.9, β₂=0.999, ε=1e−8) Adam often trains reliably with minimal tuning, which is why it became a default workhorse across deep learning. It is not universally best, but it is a dependable starting point.

Real-world applications

Common misconceptions

A frequent misconception is that gradient descent finds the single global minimum. In practice, especially for neural networks, it finds a good-enough minimum, and that is usually sufficient. Another is the belief that local minima are the main danger; research suggests that in high-dimensional spaces, saddle points and flat plateaus are the bigger obstacle, because true local minima are statistically rare. People also assume Adam is always superior to SGD — yet well-tuned SGD sometimes generalises better. Finally, many treat the learning rate as a fixed number, when in reality a decaying schedule or warm-up often outperforms any constant value. Understanding these nuances prevents wasted experiments.

Frequently Asked Questions

What is gradient descent in simple terms? Gradient descent is an iterative method that repeatedly nudges a model's parameters in the direction that most reduces an error (loss) function. By following the downhill slope of that function, it gradually settles into a configuration that makes accurate predictions.

What is the difference between batch, stochastic and mini-batch gradient descent? Batch gradient descent uses the entire dataset for each update, stochastic gradient descent uses a single example, and mini-batch uses a small subset. Mini-batch is the practical compromise used in nearly all deep learning, balancing stable gradients with computational efficiency.

Why is the learning rate so important? The learning rate scales each update step. Too large and training diverges or oscillates; too small and convergence is painfully slow. Choosing a sensible learning rate, often with a schedule that decays over time, is one of the most influential decisions in training a model.

What does momentum actually do?

Momentum accumulates an exponentially weighted average of past gradients, so the update gains velocity along consistent directions and dampens oscillations across narrow valleys. This typically speeds up convergence and helps escape shallow regions of the loss surface.

How does RMSprop differ from plain gradient descent?

RMSprop divides each gradient by a running root-mean-square of recent gradient magnitudes, giving every parameter its own adaptive effective learning rate. Parameters with large, noisy gradients are slowed, while those with small gradients are accelerated.

Why is Adam so popular?

Adam combines momentum's smoothed direction with RMSprop's per-parameter scaling, plus bias correction for early steps. It often works well with little tuning, which is why it became a default choice for many deep learning practitioners.

Can gradient descent get stuck in local minima?

In low dimensions it can, but research suggests that in the very high-dimensional loss landscapes of neural networks, saddle points and flat plateaus are a greater obstacle than poor local minima. Momentum and adaptive methods help traverse these regions.

What is a saddle point?

A saddle point is where the gradient is zero but the surface curves upward in some directions and downward in others. Plain gradient descent can stall there because the gradient nearly vanishes, whereas momentum-based methods can carry through.

Does Adam always beat SGD?

Not always. Adam often trains faster, but well-tuned stochastic gradient descent with momentum sometimes generalises better, particularly in computer vision. The best optimiser depends on the model, data and tuning budget.

How is gradient descent connected to backpropagation?

Backpropagation is the algorithm that efficiently computes the gradients of the loss with respect to every parameter using the chain rule. Gradient descent then uses those gradients to update the parameters. They work together on every training step.

Try it yourself

Abstract equations become intuitive once you can watch them move. Explore these interactive simulations to see optimisation in action:

Conclusion

Gradient descent turns the abstract goal of "make the model better" into a concrete, repeatable procedure: measure the slope, step downhill, repeat. The plain algorithm is elegant but fragile on difficult loss surfaces, which is why momentum, RMSprop and Adam were developed to add velocity, per-parameter adaptation and robustness. Together they make training enormous modern models practical rather than hopeless. There is no single best optimiser for every problem, so understanding the trade-offs — and experimenting with learning rates and schedules — remains a core skill. The best way to build intuition is to experiment, so try the simulations above and watch optimisation unfold.