Question 1

What is gradient descent?

Accepted Answer

Gradient descent is an iterative optimisation algorithm that minimises a loss function by repeatedly moving parameters in the direction of the negative gradient. It is the backbone of training neural networks and most machine-learning models.

Question 2

How does the learning rate affect convergence?

Accepted Answer

The learning rate controls the step size taken in the direction of the negative gradient. Too large a rate causes the optimiser to overshoot minima and diverge; too small a rate makes training very slow. Adaptive methods like Adam adjust effective learning rates per parameter automatically.

Question 3

What is the difference between SGD and Adam?

Accepted Answer

SGD (Stochastic Gradient Descent) updates parameters by subtracting a fixed fraction of the gradient. Adam (Adaptive Moment Estimation) maintains exponential moving averages of both gradients and squared gradients, giving each parameter its own adaptive learning rate with bias correction, which typically converges much faster on complex surfaces.

Question 4

What is the Rosenbrock function?

Accepted Answer

The Rosenbrock function f(x,y) = (1-x)² + 100(y-x²)² is a classic optimisation test. Its global minimum is at (1,1) inside a narrow, curved valley, making it challenging for first-order optimisers because the gradient along the valley is very small compared to across it.

Question 5

Why does Momentum accelerate gradient descent?

Accepted Answer

Momentum accumulates a velocity vector across iterations. In flat directions the velocity builds up over many steps, giving faster movement; in directions that oscillate, the contributions partly cancel, reducing jitter. This combination accelerates descent in flat directions while damping oscillations.

Question 6

What is RMSprop?

Accepted Answer

RMSprop (Root Mean Square Propagation) divides the learning rate by the root of an exponential moving average of squared gradients. This normalises the update size per dimension, fixing the problem of very large or very small gradients that hamper vanilla SGD on non-convex surfaces.

Question 7

What makes the Rastrigin function hard to optimise?

Accepted Answer

The Rastrigin function is highly multimodal: it has many local minima arranged on a regular grid, making it easy for gradient-based optimisers to get trapped. The global minimum is at the origin. It tests an optimiser's ability to escape shallow local minima.

Question 8

What is bias correction in Adam?

Accepted Answer

Adam initialises its moment estimates at zero, which biases them towards zero in early iterations. Bias correction divides the first moment estimate by (1-β₁ᵗ) and the second by (1-β₂ᵗ), where t is the iteration count. This counteracts the initialisation bias and makes Adam perform well even in the very first steps.

Question 9

How is the numerical gradient computed in this simulation?

Accepted Answer

The simulation uses a central finite difference: df/dx ≈ (f(x+h,y) − f(x−h,y)) / (2h), and similarly for y. A small step h (typically 1×10⁻⁴) is used. Central differences are more accurate than forward differences because their truncation error is O(h²) rather than O(h).

Question 10

Why do all four optimisers start from the same point?

Accepted Answer

Starting all optimisers from identical initial conditions isolates the effect of the algorithm itself. If starting points differed, one optimiser might simply be luckier with initialisation rather than genuinely faster. Identical starts make the comparison fair and educational.

📉 Gradient Descent

Frequently Asked Questions about Gradient Descent

What is gradient descent?

How does the learning rate affect convergence?

What is the difference between SGD and Adam?

About this simulation

🔬 What it shows

🎮 How to use

💡 Did you know?

Frequently asked questions