How ChatGPT Works
ChatGPT generates text one token at a time by predicting the most likely continuation of a sequence. The model that does this is a transformer — a neural network architecture built on a mechanism called self-attention. Here's how it all connects, from raw text to a streamed response.
Tokenisation
Language models don't see words — they see tokens, variable-length subwords produced by a BPE (byte-pair encoding) vocabulary. A vocabulary of ~50,000 token types covers most text in most languages.
For example, the sentence "The transformer is brilliant!" becomes:
(␣ = space prefix.) Common words are single tokens; rare words are split into two or more. Code, maths, and non-English text are typically more tokens per character.
Token Embeddings
Each token ID is looked up in an embedding matrix (shape: [vocab_size × d_model]) to produce a dense vector of real numbers. For GPT-4, d_model is likely 12,288.
Semantically similar words cluster together in embedding space: "king" − "man" + "woman" ≈ "queen". The model learns these relationships purely from statistics during pre-training.
Positional Encoding
Self-attention (next section) treats tokens as a set, not a sequence — it has no notion of order. Positional encodings are added to each token embedding to inject position information.
GPT models use learned positional embeddings: each position 0…n gets its own learnable vector, added to the token embedding. Newer models (e.g., Llama, Mistral) use rotary positional embedding (RoPE), which encodes relative positions and generalises better to longer sequences.
Self-Attention
Self-attention is the core mechanism that lets the model relate any token to any other token in the sequence. For each token, three vectors are computed by multiplying the token embedding by three weight matrices:
K = x · WK (Key — "what do I represent?")
V = x · WV (Value — "what do I communicate if attended?")
The attention output is:
The Q·Kᵀ dot product gives a score for every token pair — how much should token i attend to token j? Dividing by √dk prevents the scores from growing too large as the model dimension increases. Softmax normalises the scores into probabilities. The final output is a weighted sum of the Value vectors.
Multi-head attention runs this process h times in parallel (with separate weight matrices), then concatenates and projects the results — allowing the model to attend to different aspects of the context simultaneously.
The Transformer Block
One transformer block consists of two sub-layers, each followed by layer normalisation and a residual (skip) connection:
Multi-Head Self-Attention
Each token attends to all (previous) tokens. The output is added back via residual connection.
Feed-Forward Network (FFN)
A 2-layer MLP applied independently to each token position. Typically 4× the model width. Stores "factual" knowledge.
Layer Norm + Residual
Normalise activations; add input directly to output (skip connection). Allows gradients to flow through hundreds of layers.
GPT-4 likely stacks ~96 such blocks. The final hidden state of each token is projected through a language model head (an unembedding matrix, shape [d_model × vocab_size]) to produce logits — one number per vocabulary token.
Autoregressive Generation
At each step, the model takes the entire context (prompt + tokens generated so far), runs a forward pass, reads the logits for the last position, and samples one new token. That token is appended to the context; the process repeats.
prompt + new_token → [forward pass] → logits → sample → next_token
...repeat until <end-of-sequence>
This is why generation is inherently sequential and why "streaming" works — each token can be sent to the user as soon as it's sampled.
Temperature and Sampling
The logits are converted to probabilities with softmax, but before that, a temperature parameter T divides every logit:
- T → 0: Greedy — always pick the highest-probability token. Deterministic but repetitive.
- T = 1: Use the model's raw distribution. Default for creative tasks.
- T > 1: Flatter distribution — more surprising/random output.
Top-k and Top-p (nucleus) sampling
Top-k restricts sampling to the k most likely tokens (k = 40–200 typical). Top-p restricts to the smallest set of tokens whose cumulative probability exceeds p (p = 0.9 typical). Both prevent sampling extremely unlikely garbage tokens while maintaining diversity.
Pre-training and RLHF
Pre-training
The base GPT model is trained to predict the next token on a massive text corpus (hundreds of billions to trillions of tokens scraped from the web, books, code, etc.). Loss is cross-entropy. This is entirely self-supervised — no human labels required.
Instruction fine-tuning (SFT)
The base model is fine-tuned on human-written examples of the form "User: [question] → Assistant: [good answer]". This teaches the model to be helpful and to follow instructions, not to just complete text with the most statistically likely continuation.
RLHF (Reinforcement Learning from Human Feedback)
A separate reward model is trained on human preference data (pairs of responses ranked by quality). The main model is then fine-tuned with PPO (Proximal Policy Optimisation) to generate responses that the reward model scores highly. This is what makes ChatGPT safe, helpful, and coherent rather than erratic.