🤖 LLMs · Transformers · NLP
📅 Березень 2026 ⏱ ≈ 10 хв читання 🟡 Середній

How ChatGPT Works

ChatGPT generates text one token at a time by predicting the most likely continuation of a sequence. The model that does this is a transformer — a neural network architecture built on a mechanism called self-attention. Here's how it all connects, from raw text to a streamed response.

Tokenisation

Language models don't see words — they see tokens, variable-length subwords produced by a BPE (byte-pair encoding) vocabulary. A vocabulary of ~50,000 token types covers most text in most languages.

For example, the sentence "The transformer is brilliant!" becomes:

The trans former ␣is ␣brill iant !

(␣ = space prefix.) Common words are single tokens; rare words are split into two or more. Code, maths, and non-English text are typically more tokens per character.

Cost implication: OpenAI and other providers charge per token. GPT-4o: $5/M input tokens, $15/M output tokens (approximate). A 10-page PDF is roughly 3,000–5,000 tokens.

Token Embeddings

Each token ID is looked up in an embedding matrix (shape: [vocab_size × d_model]) to produce a dense vector of real numbers. For GPT-4, d_model is likely 12,288.

x = Embedding[token_id] → ℝd_model

Semantically similar words cluster together in embedding space: "king" − "man" + "woman" ≈ "queen". The model learns these relationships purely from statistics during pre-training.

Positional Encoding

Self-attention (next section) treats tokens as a set, not a sequence — it has no notion of order. Positional encodings are added to each token embedding to inject position information.

GPT models use learned positional embeddings: each position 0…n gets its own learnable vector, added to the token embedding. Newer models (e.g., Llama, Mistral) use rotary positional embedding (RoPE), which encodes relative positions and generalises better to longer sequences.

Self-Attention

Self-attention is the core mechanism that lets the model relate any token to any other token in the sequence. For each token, three vectors are computed by multiplying the token embedding by three weight matrices:

Q = x · WQ (Query — "what am I looking for?")
K = x · WK (Key — "what do I represent?")
V = x · WV (Value — "what do I communicate if attended?")

The attention output is:

Attention(Q, K, V) = softmax( Q · Kᵀ / √dk ) · V

The Q·Kᵀ dot product gives a score for every token pair — how much should token i attend to token j? Dividing by √dk prevents the scores from growing too large as the model dimension increases. Softmax normalises the scores into probabilities. The final output is a weighted sum of the Value vectors.

Multi-head attention runs this process h times in parallel (with separate weight matrices), then concatenates and projects the results — allowing the model to attend to different aspects of the context simultaneously.

Causal masking: During text generation, future tokens aren't allowed to attend to past tokens. The Q·Kᵀ matrix is masked to −∞ for positions j > i, so token i can only see tokens 0…i.

The Transformer Block

One transformer block consists of two sub-layers, each followed by layer normalisation and a residual (skip) connection:

1

Multi-Head Self-Attention

Each token attends to all (previous) tokens. The output is added back via residual connection.

2

Feed-Forward Network (FFN)

A 2-layer MLP applied independently to each token position. Typically 4× the model width. Stores "factual" knowledge.

3

Layer Norm + Residual

Normalise activations; add input directly to output (skip connection). Allows gradients to flow through hundreds of layers.

GPT-4 likely stacks ~96 such blocks. The final hidden state of each token is projected through a language model head (an unembedding matrix, shape [d_model × vocab_size]) to produce logits — one number per vocabulary token.

Autoregressive Generation

At each step, the model takes the entire context (prompt + tokens generated so far), runs a forward pass, reads the logits for the last position, and samples one new token. That token is appended to the context; the process repeats.

prompt → [forward pass] → logits → sample → new_token
prompt + new_token → [forward pass] → logits → sample → next_token
...repeat until <end-of-sequence>

This is why generation is inherently sequential and why "streaming" works — each token can be sent to the user as soon as it's sampled.

Temperature and Sampling

The logits are converted to probabilities with softmax, but before that, a temperature parameter T divides every logit:

p(token) = softmax( logits / T )

Top-k and Top-p (nucleus) sampling

Top-k restricts sampling to the k most likely tokens (k = 40–200 typical). Top-p restricts to the smallest set of tokens whose cumulative probability exceeds p (p = 0.9 typical). Both prevent sampling extremely unlikely garbage tokens while maintaining diversity.

Pre-training and RLHF

Pre-training

The base GPT model is trained to predict the next token on a massive text corpus (hundreds of billions to trillions of tokens scraped from the web, books, code, etc.). Loss is cross-entropy. This is entirely self-supervised — no human labels required.

Instruction fine-tuning (SFT)

The base model is fine-tuned on human-written examples of the form "User: [question] → Assistant: [good answer]". This teaches the model to be helpful and to follow instructions, not to just complete text with the most statistically likely continuation.

RLHF (Reinforcement Learning from Human Feedback)

A separate reward model is trained on human preference data (pairs of responses ranked by quality). The main model is then fine-tuned with PPO (Proximal Policy Optimisation) to generate responses that the reward model scores highly. This is what makes ChatGPT safe, helpful, and coherent rather than erratic.

Why does ChatGPT sometimes make things up? The model is fundamentally a next-token predictor, not a knowledge lookup. It will confidently generate a plausible-sounding continuation even when it doesn't "know" the answer — a phenomenon called hallucination. RLHF reduces but doesn't eliminate it. Mitigation approaches include retrieval-augmented generation (RAG) and chain-of-thought prompting.
🧠 Open Neural Network →