💻 Computer Architecture
📅 Березень 2026⏱ 10 хв читання🟡 Середній

How a CPU Works: Fetch, Decode, Execute

Your CPU executes 3–5 billion cycles per second. In each cycle, it can simultaneously process dozens of instructions at different stages of completion, predict the future path of your code, re-order operations on the fly, and retrieve data from a hierarchy of caches — all to overcome the fundamental constraint that memory is thousands of times slower than arithmetic.

1. Von Neumann Architecture

John von Neumann (1945) described the architecture all mainstream CPUs still follow: a single shared memory holds both program instructions and data. The processor reads instructions from memory sequentially, executes them, and stores results back to memory.

Core components:

2. The Basic Cycle: Fetch → Decode → Execute

  1. Fetch: Read the next instruction bytes from memory (via instruction cache) into the instruction register. Increment the PC.
  2. Decode: Parse the opcode and operands. Determine what operation is needed, which registers are sources/destinations, what addressing mode is used.
  3. Execute: Send operands to the ALU/FPU. Perform the operation. For memory-referencing instructions, calculate the effective address and read/write data.
  4. Write-back: Store the result back into the destination register.

An unpipelined processor completes 1 instruction per 4 clock cycles. At 1 GHz: 250 million instructions/second. Modern CPUs execute 4–8 instructions per cycle — 800× improvement, achieved by pipelining and superscalar execution.

3. Pipelining

Just as a car assembly line overlaps production stages, pipelining overlaps instruction stages. While instruction N is executing, instruction N+1 is decoding and N+2 is being fetched simultaneously:

Cycle:
1
2
3
4
5
6
Instr 1
IF
ID
EX
WB
Instr 2
IF
ID
EX
WB
Instr 3
IF
ID
EX
WB

A 5-stage pipeline can perform 5× more work per clock cycle. Modern x86 pipelines are 14–19 stages deep (Intel Skylake: 14). Deeper pipelines allow higher clock speeds but increase the cost of pipeline flushes from branch mispredictions.

Hazards that stall the pipeline:

4. Branch Prediction

~20% of instructions are branches. Without prediction, every branch stalls the pipeline for 14+ cycles. Modern CPUs use sophisticated predictors with >99% accuracy on typical workloads:

When misprediction occurs, the CPU must flush the pipeline — discard all in-flight instructions after the branch and refetch from the correct path. Penalty: 14–20 cycles. For tight loops with unpredictable conditions, a misprediction every iteration limits IPC severely.

Spectre (2018): Branch prediction caches state from speculative execution. Attacker-controlled branches can speculatively access restricted memory before the misprediction is detected — the data is already in cache, measurable via timing. The first microarchitectural side-channel attack to affect virtually all CPUs.

5. Out-of-Order Execution

In-order execution stalls whenever an instruction is waiting for data (e.g., a memory load takes 200 cycles). Out-of-order execution (OOO) lets the CPU execute ready instructions while others wait.

Implementation uses a Reorder Buffer (ROB) — a circular buffer holding in-flight instructions in program order. Instructions enter the ROB in order, may execute out of order when their inputs are ready, and commit results to architecture state in order (preserving exception semantics).

A modern core (Intel Core Ultra, AMD Zen 5) can track 300–500+ in-flight instructions in the ROB simultaneously. This "window" of instructions allows the hardware to find independent operations and execute them in parallel on multiple execution units.

6. Cache Hierarchy

Memory latency: main DRAM takes ~60–100 ns (~200 cycles). Without caches, the CPU would spend 98% of time waiting. The cache hierarchy stores recently used data close to the CPU:

L1-I + L1-D
32–64 KB each, per-core. 4–5 cycle latency. ~256 GB/s bandwidth.
L2 Cache
256 KB – 1 MB per-core (Zen 4: 1 MB). 12–14 cycle latency.
L3 Cache (LLC)
6–192 MB shared. 30–50 cycle latency. Zen 4 desktop: 96 MB.
DRAM
GBs. 60–100 ns (~200 cycles). 50–100 GB/s bandwidth.

Cache miss penalty: L1 miss → L2: ~10 cycles extra. L2 → L3: ~30 cycles. L3 → DRAM: ~170 cycles. Writing cache-friendly code (linear access patterns, small working sets, avoiding false sharing in multi-threaded code) is often the single largest performance optimization available.

Cache lines are 64 bytes. When any byte is accessed, the entire 64-byte line is loaded. Processing a 1 GB array sequentially produces predictable access patterns that hardware prefetchers handle well. Random access is catastrophic for performance.

7. Modern CPUs: Cores, Threads & More