Information Theory — Shannon Entropy and the Limits of Communication
Shannon entropy, channel capacity, data compression, error correction, and applications in cryptography and biology.
1. What Is Information?
Claude Shannon's landmark 1948 paper "A Mathematical Theory of Communication" created information theory from scratch. In a single paper, Shannon defined what information is, proved fundamental limits on compression and transmission, and founded the mathematical framework that underlies every digital communication system in existence.
The key insight is conceptually surprising: information is about surprise, not meaning. A certain event conveys no information — if you already know the sun will rise, being told it rose tells you nothing. A rare event conveys a great deal. This breaks completely from intuitive notions of "meaningful content" and replaces them with something mathematically precise.
The self-information (or surprisal) of an event with probability p(x) is defined as:
The logarithm base 2 gives units of bits — a natural unit because a fair binary choice carries exactly 1 bit. Using base e gives nats (natural units), and base 10 gives hartleys. These are all equivalent up to a constant multiplicative factor.
A crucial property: information is additive for independent events. If two fair coins are flipped independently, witnessing the outcome of both gives 2 bits of information — the sum of the individual surprisals. This additivity property (from the logarithm) makes the definition mathematically natural and operationally meaningful.
2. Shannon Entropy
While self-information characterizes a single event, Shannon entropy H(X) characterizes an entire random variable — it is the expected (average) self-information, representing the average uncertainty before observing an outcome.
For a fair coin (p = 0.5 each side): H = 1 bit. For a biased coin with p = 0.9 for heads: H ≈ 0.47 bits — less uncertain, so less average information per flip. Entropy is maximized by the uniform distribution and equals exactly 0 when the outcome is certain. It can never be negative.
The key relationships between entropies capture how information flows between variables:
- Joint entropy H(X,Y): total uncertainty about both variables together
- Chain rule: H(X,Y) = H(X) + H(Y|X) — joint entropy equals the entropy of X plus the remaining uncertainty about Y given X
- Mutual information I(X;Y): the reduction in uncertainty about X from knowing Y, or equivalently, how much X and Y share
Entropy also connects to thermodynamics: Boltzmann's entropy formula S = k log W has the same mathematical structure as Shannon entropy. This is not a coincidence — both measure the number of distinguishable states or configurations underlying an observed macrostate.
3. Source Coding and Data Compression
Shannon's source coding theorem establishes the fundamental limit on lossless data compression: it is impossible to compress a data source below its entropy rate H(X) bits per symbol on average, but it is always achievable to get arbitrarily close. Entropy is the true irreducible information content.
Huffman coding (1952) is the simplest optimal prefix-free code. The algorithm builds a binary tree bottom-up: repeatedly merge the two least-probable symbols into a combined node, assigning them bit suffixes 0 and 1. The resulting code assigns shorter binary strings to more probable symbols.
Example with symbols A (p=0.5), B (p=0.25), C (p=0.125), D (p=0.125):
- Huffman codes: A =
0, B =10, C =110, D =111 - Average code length: 0.5×1 + 0.25×2 + 0.125×3 + 0.125×3 = 1.75 bits
- Entropy: H = 0.5×1 + 0.25×2 + 0.125×3 + 0.125×3 = 1.75 bits — exactly optimal here
Arithmetic coding encodes an entire message as a single number in [0, 1). It maintains a current interval and narrows it with each symbol, proportional to the symbol's probability. Unlike Huffman, it is not constrained to whole-bit codewords and achieves entropy more precisely, especially for small alphabets or heavily skewed distributions.
Lempel-Ziv-Welch (LZW) takes a different approach: dictionary-based compression that replaces repeated patterns with references to a shared codebook. It works without prior knowledge of the source distribution (it is universal). LZW underlies GIF image compression, the ZIP format, and PDF streams.
4. Channel Capacity and Error Correction
A real communication channel introduces noise: bit flips, erasures, and interference. Shannon's noisy channel coding theorem (the deepest result in information theory) states something remarkable: reliable communication is possible at any rate below channel capacity C, no matter how noisy the channel — as long as you use the right error-correcting code. Conversely, reliable communication is impossible above C.
The Shannon-Hartley theorem gives capacity for an additive white Gaussian noise (AWGN) channel:
This formula sets absolute limits on what any communication system can achieve, regardless of the modulation scheme or signal processing used. The only levers are bandwidth and signal power.
Error-correcting codes add controlled redundancy so receivers can detect and correct errors:
- Hamming codes: the first practical error-correcting codes (1950). They can detect 2-bit errors and correct 1-bit errors using a minimum Hamming distance of 3 between codewords. Still used in ECC (error-correcting) RAM modules in servers.
- Reed-Solomon codes: polynomial-based codes that excel at correcting burst errors (multiple consecutive bit errors). Used in CDs, DVDs, QR codes, and deep-space communication (Voyager probes). A scratched CD can lose centimeters of data and still play correctly.
- Turbo codes and LDPC (Low-Density Parity-Check) codes: iterative decoding algorithms that approach the Shannon limit to within 0.1 dB. Turbo codes are used in 4G LTE; LDPC codes power 5G NR and Wi-Fi 6.
Applications of information theory extend far beyond communications:
- DNA data storage: the four-base alphabet (A, T, G, C) gives 2 bits per base in theory; in practice, biological synthesis/sequencing errors require specialized codes. Researchers have stored gigabytes of data in synthetic DNA with information-theoretically designed codebooks.
- Quantum error correction: quantum bits (qubits) decohere from environmental noise; quantum error-correcting codes (like the surface code) protect logical qubits from physical errors, enabling fault-tolerant quantum computation.
- Cryptography: Shannon proved that the one-time pad achieves perfect secrecy — an eavesdropper learns nothing about the plaintext — when the key is truly random and has entropy at least equal to the message entropy. All symmetric encryption aims to approach this ideal.
Explore Algorithm Simulations
See data compression, sorting and search algorithms visualized in real time.