Question 1

What is Q-learning?

Accepted Answer

Q-learning is a model-free reinforcement learning algorithm that learns the value of taking each action in each state. It builds a table of Q(s,a) values purely from trial and reward, with no prior model of the environment, and eventually those values point to the best action everywhere.

Question 2

What is the agent trying to do here?

Accepted Answer

The yellow agent starts at the top-left corner and tries to reach the green goal cell, which gives a reward of plus 10. Red traps give minus 5 and end the episode, and every step costs minus 0.02, so the agent is pushed to find the shortest safe path.

Question 3

What does the Bellman update do?

Accepted Answer

After each move the agent applies the Bellman update. The temporal-difference error is the gap between the reward plus the best discounted future value and the current estimate. The learning rate alpha controls how much of that error is absorbed.

Question 4

What do the learning rate, discount and exploration sliders change?

Accepted Answer

The learning rate alpha from 0.01 to 1 sets how fast Q-values move toward new estimates; high values learn quickly but can be unstable. The discount gamma from 0.1 to 0.99 weights future reward, so values near 1 plan further ahead. Exploration epsilon from 0 to 1 is the probability of choosing a random action instead of the current best.

Question 5

What is the epsilon-greedy strategy?

Accepted Answer

With probability epsilon the agent picks a random action to explore, and with probability one minus epsilon it picks the action with the highest known Q-value to exploit what it has learnt. In this page epsilon starts at the slider value and decays by 0.5 percent per episode, so the agent explores boldly early on and settles into exploitation later.

Question 6

What do the colours and arrows on the grid mean?

Accepted Answer

Each cell brightness encodes its maximum Q-value, so brighter cells are more valuable. Arrows show the greedy policy direction from that cell once its value is positive. The green star is the goal, red crosses are traps, dark-blue cells are walls and the yellow dot is the agent.

Question 7

Why does the agent appear to wander early on?

Accepted Answer

At the start every Q-value is zero, so the agent has no idea where the goal is and explores almost at random, especially with a high epsilon. As rewards propagate backwards through the Bellman update, a value gradient forms toward the goal and the wandering gives way to a clear, purposeful path.

Question 8

Is this a physically or mathematically accurate model?

Accepted Answer

Yes for the idealised case it represents. Tabular Q-learning is proven to converge to the optimal action-value function in a finite Markov decision process, provided every state-action pair is visited infinitely often and the learning rate decays appropriately. The grid here is a faithful small MDP, though it uses a fixed step count rather than formal decay schedules.

Question 9

What is the difference between the value function and the policy?

Accepted Answer

The value function tells you how good each state is, shown here as cell brightness from the maximum Q-value. The policy tells you what to do, shown as the greedy arrows. A good value function makes a good policy easy to read off: simply move toward the neighbouring cell with the highest value.

Question 10

Where is reinforcement learning used in the real world?

Accepted Answer

The same principles drive game-playing systems such as AlphaGo and Atari agents, robot control and locomotion, traffic-light and energy management, recommendation engines and the fine-tuning of large language models. Grid worlds like this one are the classic teaching environment because they make the value map and policy easy to visualise.

🤖 Reinforcement Learning — Q-Learning Grid World

How to read the grid

The Physics

About Q-Learning Grid World

Frequently Asked Questions