Adaptive Systems
July 2026 · 14 min read · Reinforcement Learning · Markov Decision Processes · Q-Learning · Last updated: 3 July 2026

Reinforcement Learning: Intro to Q-Learning

Written by MySimulator Team · Reviewed by MySimulator Editorial Review

No labelled dataset, no supervisor telling it the right answer — just an agent, an environment, and a trickle of reward. Q-learning, introduced by Chris Watkins in 1989, is the algorithm that made this setting tractable: it learns the value of every action in every situation purely from experience, and provably converges to the optimal way to behave.

1. Markov Decision Processes

Reinforcement learning problems are formalised as a Markov Decision Process (MDP): a tuple (S, A, P, R, γ) where S is the set of states, A is the set of actions, P(s'|s,a) is the transition probability of landing in state s' after taking action a in state s, R(s,a,s') is the reward received, and γ ∈ [0,1) is the discount factor that trades off immediate versus future reward.

The defining property is the Markov property: the future depends only on the current state, not on the history that led there. A grid-world agent's next position depends only on where it is now and which action it takes — not on the path it walked to get there. This memorylessness is what makes the problem tractable: the agent never needs to remember more than the present state.

The agent's behaviour is a policy π(a|s), a mapping from states to actions (or a probability distribution over actions). The goal of reinforcement learning is to find the policy π* that maximises expected cumulative discounted reward, the return:

G_t = R_t+1 + γR_t+2 + γ²R_t+3 + ... = Σ_{k=0}^∞ γᵏ R_t+k+1
Why discount? γ < 1 keeps the sum finite for infinite-horizon tasks, encodes a preference for reward sooner rather than later, and models uncertainty about the future — a reward promised 50 steps from now is worth less than one available immediately, because the episode might end or the environment might change before then.

2. Value Functions and the Bellman Equation

Rather than evaluate entire trajectories, RL algorithms decompose the problem using value functions. The state-value function Vπ(s) is the expected return starting from state s and following policy π thereafter. More useful for control is the action-value function, or Q-function:

Qπ(s,a) = Eπ[ G_t | S_t = s, A_t = a ]

Q(s,a) answers a very specific question: "if I take action a right now, and then follow policy π forever after, how much total discounted reward do I expect?" Once you know Q for the optimal policy, acting optimally is trivial — just pick argmaxa Q*(s,a) in every state. This is the entire premise of Q-learning: learn Q* directly, and the optimal policy falls out for free.

Value functions satisfy a recursive relationship called the Bellman equation, which expresses the value of a state in terms of the values of its successor states:

Qπ(s,a) = Σ_s' P(s'|s,a) [ R(s,a,s') + γ Σ_a' π(a'|s') Qπ(s',a') ]

For the optimal Q-function, the policy inside the recursion is replaced by a max, giving the Bellman optimality equation:

Q*(s,a) = Σ_s' P(s'|s,a) [ R(s,a,s') + γ max_a' Q*(s',a') ]

This equation is the mathematical heart of Q-learning. It says the optimal value of taking action a in state s equals the immediate reward, plus the discounted value of behaving optimally from whatever state comes next.

3. The Q-Learning Update Rule

The Bellman optimality equation is a fixed-point equation involving the (usually unknown) transition probabilities P and reward function R. Q-learning sidesteps needing a model of the environment entirely — it is a model-free, temporal-difference (TD) method that learns Q* directly from sampled transitions (s, a, r, s').

After observing a single transition, the agent nudges its current estimate of Q(s,a) toward a better estimate — the observed reward plus the discounted value of the best action in the next state:

Q(s,a) ← Q(s,a) + α [ r + γ max_a' Q(s',a') − Q(s,a) ] └──────────── TD target ────────────┘ └TD error┘

Here α ∈ (0,1] is the learning rate, controlling how much each new experience overwrites the old estimate. The bracketed quantity r + γ max_a' Q(s',a') − Q(s,a) is the TD error — the discrepancy between what the agent predicted and what it actually observed one step later. Q-learning is off-policy: the update uses max_a' Q(s',a'), the value of the best next action, regardless of which action the agent actually takes next. The agent can behave exploratively while still learning about the greedy, optimal policy.

4. The Full Algorithm

In its tabular form, Q-learning maintains a table Q[s][a] with one entry per state-action pair, initialised arbitrarily (commonly to zero or small random values):

Initialize Q(s,a) arbitrarily for all s in S, a in A(s) Initialize Q(terminal, ·) = 0 for each episode: s ← initial state while s is not terminal: a ← choose action from s using policy derived from Q (e.g. ε-greedy) take action a, observe reward r and next state s' Q(s,a) ← Q(s,a) + α [ r + γ max_a' Q(s',a') − Q(s,a) ] s ← s'

Each pass through the loop consumes exactly one interaction with the environment. Over thousands of episodes, values propagate backwards from states near the goal (where reward is received) toward states further away, gradually filling in an accurate map of long-term consequences for every state-action pair — a process often visualised as "reward flowing upstream" through the state graph.

5. Exploration vs Exploitation

If the agent always picks the action with the highest current Q-estimate (exploitation), it can get permanently stuck on a mediocre policy simply because it never tried the better action enough times to learn its true value. If it always acts randomly (exploration), it never capitalises on what it has learned. This tension is the exploration-exploitation trade-off, and every practical RL algorithm needs a strategy to balance it.

ε-Greedy

The simplest and most widely used strategy: with probability ε, take a uniformly random action; with probability 1−ε, take the greedy action argmaxa Q(s,a). ε typically starts high (e.g. 1.0, pure exploration) and is annealed toward a small floor (e.g. 0.05) over training, shifting the agent from discovery to exploitation as its estimates become trustworthy.

Softmax / Boltzmann Exploration

Instead of a hard random/greedy split, actions are sampled proportional to exp(Q(s,a)/τ), where τ is a temperature parameter. High τ makes the distribution near-uniform (exploration); low τ makes it sharply peaked at the best action (exploitation). Unlike ε-greedy, this naturally avoids wasting exploration on actions already known to be far worse than the best.

Optimistic Initialisation

Initialising all Q-values above their true expected value encourages exploration implicitly: any action tried gets its estimate revised downward toward reality, making untried actions look relatively more attractive until every action has been sampled enough to establish its real value.

6. Why Q-Learning Converges

Watkins and Dayan proved in 1992 that tabular Q-learning converges to Q* with probability 1, provided two conditions hold:

The intuition connects to fixed-point theory: the Bellman optimality operator is a contraction mapping in the max-norm, meaning repeated application shrinks the distance to the true Q* geometrically regardless of starting point. Each Q-learning update is a stochastic approximation of applying that operator, so under the conditions above, the noisy updates converge to the same fixed point that the exact operator would reach.

In practice: "infinitely often" and "decaying learning rate" are asymptotic guarantees. In finite training runs, a constant small α (e.g. 0.1) and a slowly-annealed ε usually work well enough, trading the convergence guarantee for the ability to keep adapting if the environment is non-stationary.

7. Q-Learning vs SARSA

SARSA (State-Action-Reward-State-Action) is Q-learning's closest relative and differs in exactly one line of the update rule:

Q-learning (off-policy): Q(s,a) ← Q(s,a) + α [ r + γ max_a' Q(s',a') − Q(s,a) ] SARSA (on-policy): Q(s,a) ← Q(s,a) + α [ r + γ Q(s',a') − Q(s,a) ] where a' is the action actually taken next, chosen by the same ε-greedy policy the agent is currently following

Q-learning bootstraps from the best possible next action; SARSA bootstraps from the action the agent's own (exploring) policy would actually take. The practical consequence shows up sharply in risky environments: in a gridworld with a cliff edge next to the optimal path, Q-learning learns the objectively optimal — but risky — route hugging the cliff, because it evaluates as if future exploration mistakes won't happen. SARSA, accounting for its own ε-greedy randomness, learns a safer route further from the edge, because it correctly predicts that an exploratory step near the cliff could occasionally be catastrophic.

8. Scaling Up: Deep Q-Networks

A table Q[s][a] is fine when the state space is small — a few thousand grid cells. It is hopeless for raw pixels (an Atari frame has more possible states than atoms in the observable universe) or continuous sensor input. The Deep Q-Network (DQN), introduced by DeepMind in 2015, replaces the table with a neural network Q(s,a; θ) that takes a state and outputs a Q-value for every action, trained by gradient descent to minimise the squared TD error:

L(θ) = E[ ( r + γ max_a' Q(s',a'; θ⁻) − Q(s,a; θ) )² ]

Naively applying gradient descent to this loss is unstable, because the network is simultaneously the predictor and, through the bootstrapped target, the source of its own moving target. DQN introduced two stabilising tricks that became standard in nearly all deep RL since:

With these additions, the same underlying Q-learning update rule that works on a handful of gridworld states scales to learning directly from raw pixels, matching or exceeding human performance across dozens of Atari games with an identical architecture and no game-specific tuning.

🤖 Browse Adaptive Systems Simulations Watch agents learn policies from reward in interactive gridworld and control environments

9. Practical Pitfalls

Overestimation Bias

The max operator in the TD target systematically overestimates Q-values, because it takes the maximum over noisy estimates rather than the estimate of the true maximum. Double Q-learning corrects this by decoupling action selection from action evaluation using two independent value estimates, reducing the systematic bias.

Reward Sparsity

If reward only arrives at the end of a long episode (e.g. "win" or "lose" after 500 steps), the TD signal has almost nothing to propagate backward for most of training. Reward shaping (adding small intermediate rewards that guide behaviour), curiosity- driven exploration bonuses, and hindsight experience replay all address this in different ways.

The Deadly Triad

Sutton and Barto identify three ingredients that, combined, can cause divergence rather than convergence: function approximation (a neural network instead of a table), bootstrapping (updating estimates from other estimates, as the Bellman equation does), and off-policy learning (learning about a policy different from the one generating the data — exactly what Q-learning's max_a' term does). All three are individually valuable and jointly risky; DQN's target networks and replay buffers are, in essence, engineering countermeasures against this instability.

State and Action Discretisation

Tabular Q-learning requires a finite, discrete state-action space. Continuous problems must either be discretised (coarse bins lose precision, fine bins explode table size — the curse of dimensionality) or handled with function approximation (DQN) or algorithms designed natively for continuous actions, such as DDPG or SAC.