How AI Generates Images — Diffusion Models Explained
Type "an astronaut riding a horse, oil painting, golden hour" and receive a photorealistic image in seconds. Stable Diffusion, DALL-E, and Midjourney all use the same underlying idea: learn to reverse a process that gradually destroys an image with noise.
1. The Big Picture
The key insight from Ho et al. (DDPM, 2020): instead of training a network to generate images in one shot, train it to remove a small amount of Gaussian noise from a slightly-noisy image. Repeat this ~1000 times, starting from pure noise.
This turns a hard problem (generate realistic images) into a curriculum of easy problems (remove noise). The result, surprisingly, produces sharper and more diverse images than GANs — without the training instability.
Clean image
slight noise
half noise
mostly noise
pure noise
Generation runs these steps in reverse: start from pure Gaussian noise and iteratively denoise into a coherent image.
2. Forward Diffusion — Adding Noise
At each timestep t, a small amount of Gaussian noise is added according to a variance schedule β_t (typically 0.0001 → 0.02 over T=1000 steps):
Using the reparametrisation trick, we can sample directly at any step:
x_t = √ᾱ_t · x_0 + √(1−ᾱ_t) · ε , where ε ~ 𝒩(0, I)
ᾱ_t = ∏_{s=1}^{t} (1 − β_s)
This closed-form expression is critical: during training we can jump directly to any noise level without running all t steps.
3. Reverse Diffusion — Denoising
The true reverse process p(x_{t-1} | x_t) — going from noisy
to clean — requires knowing exactly what image produced the noise.
This is intractable. Instead, we train a neural network ε_θ
to approximate it.
μ_θ(x_t, t) = (1/√α_t) · [x_t − β_t/√(1−ᾱ_t) · ε_θ(x_t, t)]
The network ε_θ takes the noisy image x_t and
timestep t as input and predicts the noise ε
added at that step. Once trained, we iteratively apply the reverse formula
starting from pure noise.
4. Training the Denoiser
Training is surprisingly simple. For each image in the dataset:
- Sample a random timestep
t ~ Uniform(1, T). - Sample random noise
ε ~ 𝒩(0, I). - Compute the noisy image:
x_t = √ᾱ_t · x_0 + √(1−ᾱ_t) · ε. - Run the network:
ε̂ = ε_θ(x_t, t). - Minimise:
‖ε − ε̂‖²(predict the noise that was added).
5. The U-Net Architecture
The denoising network is almost always a U-Net: a convolutional architecture with a contracting encoder path, a bottleneck, and a symmetric expanding decoder path. Skip connections between matching encoder and decoder levels preserve fine spatial detail.
Key additions for diffusion:
- Timestep embedding: t is embedded with sinusoidal encoding (like Transformer PE) and added to each residual block.
- Self-attention layers: Inserted at low spatial resolution to capture global scene coherence.
- Cross-attention: Used for conditioning (text, class labels) — see section 7.
Modern models (Stable Diffusion 3, DiT) replace the U-Net with pure Transformer architectures operating on patch sequences.
6. Latent Diffusion — Working in Compressed Space
Pixel-space diffusion is extremely expensive: a 512×512 image has 786 432 pixels per channel, each needing 1000 denoising steps. The solution (Rombach et al. 2022, the "Stable Diffusion" paper) is to work in latent space:
- Train an autoencoder (VAE) to compress 512×512 images to 64×64 latent tensors (8× compression per side).
- Run the entire diffusion process on the 64×64 latent — 64× fewer pixels.
- At inference, decode the final latent back to pixels with the VAE decoder.
This reduces compute by roughly 64× with little quality loss, enabling 1024×1024 generation to run on consumer GPUs.
7. Text Conditioning with CLIP
To control what image is generated, the denoising U-Net also receives a text embedding via cross-attention. The text encoder is typically the text tower of a CLIP model (Contrastive Language–Image Pre-training).
CLIP is trained on (image, caption) pairs using a contrastive loss — matching images to their correct descriptions from a batch of negatives. The resulting text embeddings encode rich semantic content that the diffusion model can steer toward.
In the U-Net layers: Q = latent features, K = V = text
embeddings. Each spatial location "attends" to the most relevant
text tokens, injecting semantic guidance at every denoising step.
8. Classifier-Free Guidance
Even with text conditioning, early models produced images that loosely matched the prompt. Classifier-free guidance (CFG) amplifies the text influence at inference time:
c — text prompt embedding
∅ — empty/null prompt (unconditional)
w — guidance scale (typically 7–15)
The denoiser is run twice per step: once with the text prompt, once without. The difference is amplified by scale w and added back. Higher w means stronger prompt adherence but less diversity and potential artefacts.
9. Sampling Schedulers
The original DDPM (Ho et al.) requires ~1000 denoising steps. Subsequent advances in sampling schedulers dramatically reduced this:
- DDIM (Song et al., 2020) — non-Markovian process allowing ~50 steps with similar quality. Also enables interpolation in latent space (deterministic).
- DPM-Solver++ / DPM-Solver-2M — treats diffusion as an ODE, uses higher-order Runge-Kutta steps. 20 steps ≈ DDPM quality with 1000 steps.
- PLMS / Heun — predictor-corrector methods adapted from ODE solvers.
- Flow Matching (Lipman et al., 2022) — replaces the Gaussian noise schedule with straight-line "flows" between noise and data. Used in Stable Diffusion 3 and Flux. Faster convergence, better quality.
The field continues to improve — 4-8 steps is now achievable with techniques like Consistency Models and Adversarial Diffusion Distillation (used in Stable Diffusion Turbo).