How a CPU Works: Fetch, Decode, Execute
Your CPU executes 3–5 billion cycles per second. In each cycle, it can simultaneously process dozens of instructions at different stages of completion, predict the future path of your code, re-order operations on the fly, and retrieve data from a hierarchy of caches — all to overcome the fundamental constraint that memory is thousands of times slower than arithmetic.
1. Von Neumann Architecture
John von Neumann (1945) described the architecture all mainstream CPUs still follow: a single shared memory holds both program instructions and data. The processor reads instructions from memory sequentially, executes them, and stores results back to memory.
Core components:
- Arithmetic Logic Unit (ALU): Performs integer arithmetic (+, −, ×, ÷), bitwise operations, comparisons. Implemented with transistor-based combinational logic circuits (adders, multiplexers).
- Floating-Point Unit (FPU): Handles IEEE 754 floating-point arithmetic. The FPU in your Core i9 or Ryzen 9 is larger than an entire 1980s CPU.
- Control Unit: Reads instructions, decodes them, and orchestrates the other units.
- Register file: 16–31 general-purpose 64-bit registers (x86-64: RAX, RBX, RCX, ...), 32 SIMD registers for vectorized operations.
- Program Counter (PC) / Instruction Pointer: Holds the address of the next instruction to execute.
2. The Basic Cycle: Fetch → Decode → Execute
- Fetch: Read the next instruction bytes from memory (via instruction cache) into the instruction register. Increment the PC.
- Decode: Parse the opcode and operands. Determine what operation is needed, which registers are sources/destinations, what addressing mode is used.
- Execute: Send operands to the ALU/FPU. Perform the operation. For memory-referencing instructions, calculate the effective address and read/write data.
- Write-back: Store the result back into the destination register.
An unpipelined processor completes 1 instruction per 4 clock cycles. At 1 GHz: 250 million instructions/second. Modern CPUs execute 4–8 instructions per cycle — 800× improvement, achieved by pipelining and superscalar execution.
3. Pipelining
Just as a car assembly line overlaps production stages, pipelining overlaps instruction stages. While instruction N is executing, instruction N+1 is decoding and N+2 is being fetched simultaneously:
A 5-stage pipeline can perform 5× more work per clock cycle. Modern x86 pipelines are 14–19 stages deep (Intel Skylake: 14). Deeper pipelines allow higher clock speeds but increase the cost of pipeline flushes from branch mispredictions.
Hazards that stall the pipeline:
- Data hazard: Instruction needs result not yet produced. Resolved by forwarding (bypass path from EX output back to EX input) or stalling.
- Control hazard: Branch instruction — we don't know which instruction to fetch next. Resolved by branch prediction.
- Structural hazard: Two instructions need the same hardware unit. Resolved by duplication (multiple ALUs in superscalar CPUs).
4. Branch Prediction
~20% of instructions are branches. Without prediction, every branch stalls the pipeline for 14+ cycles. Modern CPUs use sophisticated predictors with >99% accuracy on typical workloads:
- Static prediction: Assume backward branches are taken (loops), forward branches not taken. ~65% accuracy.
- Dynamic — two-bit saturating counter: Each branch has a 2-bit state (strongly not taken / weakly not taken / weakly taken / strongly taken). Updated on actual outcomes. ✓ Handles loops correctly.
- TAGE predictor (Intel/AMD): Tagged table of partial histories with geometric history lengths. Achieves >97% accuracy on SPEC benchmarks.
When misprediction occurs, the CPU must flush the pipeline — discard all in-flight instructions after the branch and refetch from the correct path. Penalty: 14–20 cycles. For tight loops with unpredictable conditions, a misprediction every iteration limits IPC severely.
5. Out-of-Order Execution
In-order execution stalls whenever an instruction is waiting for data (e.g., a memory load takes 200 cycles). Out-of-order execution (OOO) lets the CPU execute ready instructions while others wait.
Implementation uses a Reorder Buffer (ROB) — a circular buffer holding in-flight instructions in program order. Instructions enter the ROB in order, may execute out of order when their inputs are ready, and commit results to architecture state in order (preserving exception semantics).
A modern core (Intel Core Ultra, AMD Zen 5) can track 300–500+ in-flight instructions in the ROB simultaneously. This "window" of instructions allows the hardware to find independent operations and execute them in parallel on multiple execution units.
6. Cache Hierarchy
Memory latency: main DRAM takes ~60–100 ns (~200 cycles). Without caches, the CPU would spend 98% of time waiting. The cache hierarchy stores recently used data close to the CPU:
Cache miss penalty: L1 miss → L2: ~10 cycles extra. L2 → L3: ~30 cycles. L3 → DRAM: ~170 cycles. Writing cache-friendly code (linear access patterns, small working sets, avoiding false sharing in multi-threaded code) is often the single largest performance optimization available.
Cache lines are 64 bytes. When any byte is accessed, the entire 64-byte line is loaded. Processing a 1 GB array sequentially produces predictable access patterns that hardware prefetchers handle well. Random access is catastrophic for performance.
7. Modern CPUs: Cores, Threads & More
- Multiple cores: Modern desktop CPUs have 8–32 physical cores, each with its own pipeline, registers, and L1/L2 cache. Sharing only the L3 and memory controller.
- Hyper-threading (Intel SMT): Two hardware threads share one physical core. The thread switcher interleaves execution to fill pipeline bubbles from one thread with instructions from the other. Typical benefit: 20–30% throughput increase on server workloads.
- SIMD: Single Instruction Multiple Data. AVX-512 processes 16 float32 values per instruction per cycle. A single core can achieve 2 × 16 × 3.5 GHz × 2 (FMA) = 224 GFLOPS peak on 512-bit math.
- Power management: CPUs dynamically scale voltage and frequency (DVFS). Idle cores power-gate completely. Turbo Boost raises frequency of active cores above rated TDP when thermal headroom allows.
- Process nodes: TSMC 3nm (N3E) used in Apple M3, AMD Zen 5. Intel 18A (1.8 nm generation) entering production 2025. Transistor count: Apple M3 Ultra — 184 billion transistors.