Q-learning reinforcement learning agent in an 8×8 grid maze. The agent explores using an ε-greedy policy: randomly with probability ε, greedily otherwise. After each step, Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',·) − Q(s,a)]. The value function Q-table is visualised as a colour heatmap; the greedy policy is shown as arrows. Three maze presets with walls and fire-penalty cells.