How AI Thinks: Neural Networks, Training & Inference
ChatGPT, Midjourney, and AlphaFold all work on the same basic principle: feed data into a network of billions of numbers (parameters), adjust those numbers until the network produces useful outputs, then use the trained network to generate new text, images, or predictions. Here's what actually happens inside.
1. The Artificial Neuron
An artificial neuron takes several numbers as input, multiplies each by a weight (how important is this input?), adds them up, adds a bias term, and passes the result through an activation function (to add non-linearity).
The weights and biases are the network's parameters — the numbers it learns during training. GPT-4 has an estimated 1.8 trillion parameters. Each parameter is just a floating-point number, typically stored in 16 bits (2 bytes).
2. Layers & Depth
Neurons are arranged in layers. A "deep" neural network has many layers stacked on top of each other:
- Input layer: Receives raw data (pixels, words as numbers, sensor readings).
- Hidden layers: Process and transform the data. Each layer extracts increasingly abstract features. Layer 1 might detect edges; layer 5 might detect eyes; layer 10 might detect faces.
- Output layer: Produces the final answer (probability distribution over classes, next word prediction, pixel values for an image).
Why depth matters: A shallow network (1 hidden layer) can theoretically approximate any function, but it might need an astronomically wide layer. Deep networks are more efficient — they compose simple transformations into complex ones, reusing intermediate features. This is why "deep learning" works: depth enables compositionality.
3. Training: Learning from Data
Training is the process of finding good values for all the parameters (weights and biases). It works by:
- Forward pass: Feed a training example through the network. Get an output.
- Loss calculation: Compare the output to the correct answer. Compute a loss number (how wrong was it?). For text: how surprised was the model by the correct next word?
- Backpropagation: Calculate how much each parameter contributed to the error. This uses the chain rule of calculus to compute gradients — the direction each parameter should move to reduce the loss.
- Update: Adjust each parameter slightly in the direction that reduces the loss (gradient descent). Learning rate controls how big each step is.
- Repeat: Millions of times across the entire dataset. One pass through the dataset = one epoch. GPT-3 saw ~300 billion tokens during training.
4. Transformers & Attention
The Transformer architecture (Vaswani et al., 2017, "Attention Is All You Need") is the foundation of nearly all modern AI: GPT, BERT, Stable Diffusion, AlphaFold 2.
The key innovation is self-attention: each element in the input (each word, each image patch) computes how much it should "attend to" every other element. This allows the model to capture long-range dependencies — a word at position 500 can directly reference a word at position 1.
- Query, Key, Value: Each token produces three vectors (Q, K, V). Attention score between tokens i and j = Q_i · K_j (dot product). High score = i pays attention to j. The output for token i is a weighted sum of all V vectors, weighted by attention scores.
- Multi-head attention: Multiple independent attention heads (8–128) run in parallel, each attending to different aspects (syntax, semantics, position). Their outputs are concatenated.
- Why it works: Unlike older RNNs (which process words sequentially), Transformers process all positions in parallel (fast on GPUs) and attention directly connects any pair of positions (no information bottleneck).
5. Large Language Models
An LLM (like GPT-4, Claude, Gemini, Llama) is a Transformer trained to predict the next word (token). That's its entire training objective. Everything else — answering questions, writing code, translating languages — emerges from this simple task performed at enormous scale.
- Tokenisation: Text is broken into subword tokens (~3–4 characters each). "understanding" → "under" + "stand" + "ing". GPT-4's vocabulary is ~100,000 tokens.
- Autoregressive generation: The model predicts one token at a time, appends it to the input, and predicts the next. This is why text appears word by word. Temperature parameter controls randomness: low temperature = more deterministic, high temperature = more creative.
- RLHF (Reinforcement Learning from Human Feedback): After pre-training, the model is fine-tuned using human preference data — humans rate multiple outputs, and the model learns to prefer the rated-higher outputs. This aligns the model with human expectations (helpfulness, harmlessness).
6. Image Generation (Diffusion)
Diffusion models (Stable Diffusion, DALL-E 3, Midjourney) work by learning to reverse noise:
- Forward process: Take a real image and gradually add Gaussian noise over many steps until it becomes pure static.
- Training: Train a neural network to predict and remove the noise at each step. Given a noisy image and the noise level, output the denoised version.
- Generation: Start with pure random noise. Apply the denoising network step by step. Each step removes a little noise, gradually revealing a coherent image. The text prompt guides the denoising direction (via cross-attention with a text encoder like CLIP).
The model never "copies" training images — it learns the statistical distribution of images and generates new samples from that distribution. Each generated image is novel, composed from learned patterns (textures, shapes, compositions).
7. Limitations & Misunderstandings
- Hallucinations: LLMs generate plausible-sounding text that may be factually wrong. They optimise for "sounds right" not "is right." They have no internal fact-checking mechanism.
- No real-world model: LLMs don't have a physical model of the world. They learn correlations in text, which often approximate real knowledge — but can fail in surprising ways on novel scenarios outside their training distribution.
- Training data dependency: The model can only generalise from data it's seen. Biases in training data become biases in outputs. Knowledge has a cutoff date.
- Energy cost: Training GPT-4 consumed an estimated 50 GWh — enough to power ~5,000 UK homes for a year. Inference (running the model) is cheaper but still significant at scale: ~0.01 kWh per conversation.
- Not sentient: Despite often-convincing conversation, LLMs are mathematical functions — matrices of numbers processed through equations. They have no consciousness, emotions, desires, or self-awareness. They simulate these convincingly because that's what the training data contains.