What is the Kraft inequality?

The Kraft inequality states that for any uniquely decodable code over an alphabet of size r with codeword lengths l₁, l₂, ..., lₙ, the sum Σ r^(-lᵢ) ≤ 1. Conversely, for any set of lengths satisfying Kraft's inequality, a prefix-free (instantaneously decodable) code exists with those lengths. This fundamental result connects code structure to compression theory and proves that any uniquely decodable code can be replaced by a prefix-free code of equal or shorter length.

What is the AEP (Asymptotic Equipartition Property)?

The Asymptotic Equipartition Property (AEP) is information theory's law of large numbers. For a sequence of n i.i.d. random variables with entropy H, almost all sequences of length n fall into a "typical set" of approximately 2^(nH) roughly equal-probability sequences, each with probability ≈ 2^(-nH). AEP proves that only the typical sequences need efficient encoding — the rest can be handled with longer codes — and underpins Shannon's source coding theorem.

What is channel capacity and the noisy channel coding theorem?

Channel capacity C = max_{p(x)} I(X;Y) is the maximum mutual information between input X and output Y over all input distributions. Shannon's noisy channel coding theorem guarantees: (1) rates below C are achievable with error probability approaching zero using long block codes; (2) rates above C result in unavoidable errors. This theorem was constructive in principle but Shannon's original proof used random coding — practical capacity-achieving codes (turbo codes, LDPC, polar codes) were developed decades later.

Mathematics

📅 June 22, 2026 ⏱ ~8 min read

Information Theory — Shannon Entropy and the Limits of Communication

Shannon entropy, channel capacity, data compression, error correction, and applications in cryptography and biology.

1. What Is Information?

Claude Shannon's landmark 1948 paper "A Mathematical Theory of Communication" created information theory from scratch. In a single paper, Shannon defined what information is, proved fundamental limits on compression and transmission, and founded the mathematical framework that underlies every digital communication system in existence.

The key insight is conceptually surprising: information is about surprise, not meaning. A certain event conveys no information — if you already know the sun will rise, being told it rose tells you nothing. A rare event conveys a great deal. This breaks completely from intuitive notions of "meaningful content" and replaces them with something mathematically precise.

The self-information (or surprisal) of an event with probability p(x) is defined as:

Self-information: I(x) = -log₂(p(x)) [measured in bits] Examples: Fair coin flip (p = 0.5): I = -log₂(0.5) = 1 bit Fair die, one face (p = 1/6): I = -log₂(1/6) \approx 2.58 bits Certain event (p = 1.0): I = -log₂(1) = 0 bits (no surprise) Impossible event (p \to 0): I \to \infty

The logarithm base 2 gives units of bits — a natural unit because a fair binary choice carries exactly 1 bit. Using base e gives nats (natural units), and base 10 gives hartleys. These are all equivalent up to a constant multiplicative factor.

A crucial property: information is additive for independent events. If two fair coins are flipped independently, witnessing the outcome of both gives 2 bits of information — the sum of the individual surprisals. This additivity property (from the logarithm) makes the definition mathematically natural and operationally meaningful.

2. Shannon Entropy

While self-information characterizes a single event, Shannon entropy H(X) characterizes an entire random variable — it is the expected (average) self-information, representing the average uncertainty before observing an outcome.

Shannon entropy: H(X) = -Σₓ p(x) · log₂(p(x)) [bits] Maximum entropy: uniform distribution over N outcomes H_max = log₂(N) Conditional entropy: H(Y|X) = -Σ_{x,y} p(x,y) log₂ p(y|x) Mutual information: I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = how much knowing Y reduces uncertainty about X

For a fair coin (p = 0.5 each side): H = 1 bit. For a biased coin with p = 0.9 for heads: H ≈ 0.47 bits — less uncertain, so less average information per flip. Entropy is maximized by the uniform distribution and equals exactly 0 when the outcome is certain. It can never be negative.

The key relationships between entropies capture how information flows between variables:

Joint entropy H(X,Y): total uncertainty about both variables together
Chain rule: H(X,Y) = H(X) + H(Y|X) — joint entropy equals the entropy of X plus the remaining uncertainty about Y given X
Mutual information I(X;Y): the reduction in uncertainty about X from knowing Y, or equivalently, how much X and Y share

Mutual information in practice: Mutual information measures statistical dependence without assuming linearity — it captures any form of relationship between variables. This makes it a powerful tool for feature selection in machine learning (which inputs are actually informative?) and in neuroscience (how much does a neuron's response encode a given stimulus?).

Entropy also connects to thermodynamics: Boltzmann's entropy formula S = k log W has the same mathematical structure as Shannon entropy. This is not a coincidence — both measure the number of distinguishable states or configurations underlying an observed macrostate.

3. Source Coding and Data Compression

Shannon's source coding theorem establishes the fundamental limit on lossless data compression: it is impossible to compress a data source below its entropy rate H(X) bits per symbol on average, but it is always achievable to get arbitrarily close. Entropy is the true irreducible information content.

Huffman coding (1952) is the simplest optimal prefix-free code. The algorithm builds a binary tree bottom-up: repeatedly merge the two least-probable symbols into a combined node, assigning them bit suffixes 0 and 1. The resulting code assigns shorter binary strings to more probable symbols.

Example with symbols A (p=0.5), B (p=0.25), C (p=0.125), D (p=0.125):

Huffman codes: A = 0, B = 10, C = 110, D = 111
Average code length: 0.5×1 + 0.25×2 + 0.125×3 + 0.125×3 = 1.75 bits
Entropy: H = 0.5×1 + 0.25×2 + 0.125×3 + 0.125×3 = 1.75 bits — exactly optimal here

Arithmetic coding encodes an entire message as a single number in [0, 1). It maintains a current interval and narrows it with each symbol, proportional to the symbol's probability. Unlike Huffman, it is not constrained to whole-bit codewords and achieves entropy more precisely, especially for small alphabets or heavily skewed distributions.

Lempel-Ziv-Welch (LZW) takes a different approach: dictionary-based compression that replaces repeated patterns with references to a shared codebook. It works without prior knowledge of the source distribution (it is universal). LZW underlies GIF image compression, the ZIP format, and PDF streams.

Modern compressors: Tools like zstd and Brotli combine LZ77 (dictionary matching for repeated sequences) with Huffman or ANS (Asymmetric Numeral Systems) entropy coding. ANS achieves near-arithmetic coding compression ratios at speeds approaching Huffman coding, and has become the entropy coder of choice in state-of-the-art compressors.

4. Channel Capacity and Error Correction

A real communication channel introduces noise: bit flips, erasures, and interference. Shannon's noisy channel coding theorem (the deepest result in information theory) states something remarkable: reliable communication is possible at any rate below channel capacity C, no matter how noisy the channel — as long as you use the right error-correcting code. Conversely, reliable communication is impossible above C.

The Shannon-Hartley theorem gives capacity for an additive white Gaussian noise (AWGN) channel:

Channel capacity: C = B \cdot log₂(1 + S/N) [bits per second] B = bandwidth (Hz) S/N = signal-to-noise ratio (linear, not dB) Example: telephone channel (B=3kHz, S/N=1000): C = 3000 \cdot log₂(1001) \approx 30,000 bits/s Binary symmetric channel with error prob. p: C = 1 + p\cdotlog₂(p) + (1-p)\cdotlog₂(1-p) [bits per channel use]

This formula sets absolute limits on what any communication system can achieve, regardless of the modulation scheme or signal processing used. The only levers are bandwidth and signal power.

Error-correcting codes add controlled redundancy so receivers can detect and correct errors:

Hamming codes: the first practical error-correcting codes (1950). They can detect 2-bit errors and correct 1-bit errors using a minimum Hamming distance of 3 between codewords. Still used in ECC (error-correcting) RAM modules in servers.
Reed-Solomon codes: polynomial-based codes that excel at correcting burst errors (multiple consecutive bit errors). Used in CDs, DVDs, QR codes, and deep-space communication (Voyager probes). A scratched CD can lose centimeters of data and still play correctly.
Turbo codes and LDPC (Low-Density Parity-Check) codes: iterative decoding algorithms that approach the Shannon limit to within 0.1 dB. Turbo codes are used in 4G LTE; LDPC codes power 5G NR and Wi-Fi 6.

Applications of information theory extend far beyond communications:

DNA data storage: the four-base alphabet (A, T, G, C) gives 2 bits per base in theory; in practice, biological synthesis/sequencing errors require specialized codes. Researchers have stored gigabytes of data in synthetic DNA with information-theoretically designed codebooks.
Quantum error correction: quantum bits (qubits) decohere from environmental noise; quantum error-correcting codes (like the surface code) protect logical qubits from physical errors, enabling fault-tolerant quantum computation.
Cryptography: Shannon proved that the one-time pad achieves perfect secrecy — an eavesdropper learns nothing about the plaintext — when the key is truly random and has entropy at least equal to the message entropy. All symmetric encryption aims to approach this ideal.

Explore Algorithm Simulations

See data compression, sorting and search algorithms visualized in real time.

Explore Algorithm Simulations →

Information Theory — Shannon Entropy and the Limits of Communication

1. What Is Information?

2. Shannon Entropy

3. Source Coding and Data Compression

4. Channel Capacity and Error Correction

Explore Algorithm Simulations

Related Content

Frequently Asked Questions

1. What Is Information?

2. Shannon Entropy

3. Source Coding and Data Compression

4. Channel Capacity and Error Correction

Explore Algorithm Simulations

Related Content

Related Articles

Frequently Asked Questions