🎮 Reinforcement Learning · AI
📅 Березень 2026 ⏱ ≈ 9 хв читання 🟡 Середній

Reinforcement Learning Explained

Reinforcement learning teaches an agent to act well in an environment by trial-and-error. AlphaGo, Atari-playing AIs, ChatGPT's fine-tuning (RLHF), and robotic arms all use some form of RL. At the core are just two equations: the Bellman equation and Q-learning.

Agent–Environment Loop

At every time step, the agent observes the state of the environment, chooses an action, and receives a reward and the next state.

Agent
→ aₜ →
Environment
→ sₜ₊₁, rₜ →
Agent

The goal is to find the policy π* that maximises the total accumulated reward over time.

Markov Decision Processes

The formal framework for RL is the Markov Decision Process (MDP), defined by a tuple (S, A, P, R, γ):

The Markov property says the next state depends only on the current state and action, not on the history. In practice, the agent often doesn't know P or R — it must learn from experience.

Returns and Discounting

The return Gₜ is the total reward from time t onwards. We don't weight future rewards equally — a reward now is better than the same reward far in the future. The discounted return is:

Gₜ = rₜ + γ·rₜ₊₁ + γ²·rₜ₊₂ + ... = Σₖ γᵏ rₜ₊ₖ

With γ = 0.99, a reward 100 steps away is worth only 0.99¹⁰⁰ ≈ 0.37 of a reward right now. γ = 0 means the agent is completely myopic; γ → 1 means it plans very far ahead. Typical values: 0.95–0.999.

Value Functions and Policy

The state-value function V(s) estimates the expected return when starting from state s under policy π:

Vπ(s) = 𝔼π[Gₜ | sₜ = s]

The action-value function Q(s, a) estimates the expected return when taking action a in state s, then following π:

Qπ(s, a) = 𝔼π[Gₜ | sₜ = s, aₜ = a]

If you know Q*(s,a) — the optimal Q-values — the optimal policy is simply: always pick the action with the highest Q-value: π*(s) = argmaxa Q*(s,a).

The Bellman Equation

The Bellman optimality equation for Q* expresses the recursive relationship between Q-values:

Q*(s, a) = 𝔼[ r + γ · maxa' Q*(s', a') ]

This says: the value of (state s, action a) is the immediate reward plus the discounted value of the best action from the next state. This self-consistency condition uniquely determines Q*.

Why it matters: The Bellman equation turns the problem of finding the optimal policy into a fixed-point iteration. We can start with any Q-values and repeatedly apply the Bellman update — under certain conditions this converges to Q*.

Q-Learning

Q-learning is a model-free algorithm that uses sampled experience to converge to Q* without knowing the environment's transition probabilities. The update rule, applied after each (s, a, r, s') transition:

Q(s,a) ← Q(s,a) + α · [ r + γ · maxa'Q(s',a') − Q(s,a) ]

The part in brackets is the TD error (temporal difference error) — how wrong the current Q-value is relative to the Bellman target. α is the learning rate.

For small discrete state/action spaces, Q-values are stored in a table. Example for a simple 2×2 grid with 4 movement actions:

StateLeftRightUpDown
s₀0.00.80.20.1
s₁0.30.10.90.4
s₂1.00.00.60.2

The agent picks the highlighted (highest-Q) action for each state. After many episodes, Q-learning converges to the optimal values for any finite MDP with enough exploration.

Exploration vs Exploitation

A pure greedy agent always picks the highest-Q action. But what if the Q-values are wrong early in training? It might miss better alternatives. The agent needs to explore.

ε-greedy

With probability ε, take a random action; otherwise take the greedy action. ε is typically annealed from 1.0 → 0.05 over training.

import random def select_action(Q, state, epsilon): if random.random() < epsilon: return random.choice(actions) # explore return argmax(Q[state]) # exploit

Deep Q-Networks (DQN)

For large or continuous state spaces (e.g., raw pixels from an Atari game), a Q-table has too many entries to store. A Deep Q-Network (DQN) replaces the table with a neural network: Q(s, a; θ) ≈ Q*(s, a).

The network takes the state as input and outputs one Q-value for each possible action. At DeepMind's 2015 breakthrough, the input was four stacked 84×84 grayscale frames; a CNN + two fully-connected layers output 18 Q-values (one per Atari button combination).

Two stability tricks DQN introduced

# DQN update (pseudocode) for (s, a, r, s_next) in sample_batch(replay_buffer): target = r + GAMMA * max(target_net(s_next)) prediction = online_net(s)[a] loss = mse(prediction, target) backprop(loss)

Beyond Q-Learning

🗺️ Open Pathfinding →