Mathematics
📅 June 22, 2026 ⏱ ~8 min read

Information Theory — Shannon Entropy and the Limits of Communication

Shannon entropy, channel capacity, data compression, error correction, and applications in cryptography and biology.

1. What Is Information?

Claude Shannon's landmark 1948 paper "A Mathematical Theory of Communication" created information theory from scratch. In a single paper, Shannon defined what information is, proved fundamental limits on compression and transmission, and founded the mathematical framework that underlies every digital communication system in existence.

The key insight is conceptually surprising: information is about surprise, not meaning. A certain event conveys no information — if you already know the sun will rise, being told it rose tells you nothing. A rare event conveys a great deal. This breaks completely from intuitive notions of "meaningful content" and replaces them with something mathematically precise.

The self-information (or surprisal) of an event with probability p(x) is defined as:

Self-information: I(x) = -log₂(p(x)) [measured in bits] Examples: Fair coin flip (p = 0.5): I = -log₂(0.5) = 1 bit Fair die, one face (p = 1/6): I = -log₂(1/6) ≈ 2.58 bits Certain event (p = 1.0): I = -log₂(1) = 0 bits (no surprise) Impossible event (p → 0): I → ∞

The logarithm base 2 gives units of bits — a natural unit because a fair binary choice carries exactly 1 bit. Using base e gives nats (natural units), and base 10 gives hartleys. These are all equivalent up to a constant multiplicative factor.

A crucial property: information is additive for independent events. If two fair coins are flipped independently, witnessing the outcome of both gives 2 bits of information — the sum of the individual surprisals. This additivity property (from the logarithm) makes the definition mathematically natural and operationally meaningful.

2. Shannon Entropy

While self-information characterizes a single event, Shannon entropy H(X) characterizes an entire random variable — it is the expected (average) self-information, representing the average uncertainty before observing an outcome.

Shannon entropy: H(X) = -Σₓ p(x) · log₂(p(x)) [bits] Maximum entropy: uniform distribution over N outcomes H_max = log₂(N) Conditional entropy: H(Y|X) = -Σ_{x,y} p(x,y) log₂ p(y|x) Mutual information: I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = how much knowing Y reduces uncertainty about X

For a fair coin (p = 0.5 each side): H = 1 bit. For a biased coin with p = 0.9 for heads: H ≈ 0.47 bits — less uncertain, so less average information per flip. Entropy is maximized by the uniform distribution and equals exactly 0 when the outcome is certain. It can never be negative.

The key relationships between entropies capture how information flows between variables:

Mutual information in practice: Mutual information measures statistical dependence without assuming linearity — it captures any form of relationship between variables. This makes it a powerful tool for feature selection in machine learning (which inputs are actually informative?) and in neuroscience (how much does a neuron's response encode a given stimulus?).

Entropy also connects to thermodynamics: Boltzmann's entropy formula S = k log W has the same mathematical structure as Shannon entropy. This is not a coincidence — both measure the number of distinguishable states or configurations underlying an observed macrostate.

3. Source Coding and Data Compression

Shannon's source coding theorem establishes the fundamental limit on lossless data compression: it is impossible to compress a data source below its entropy rate H(X) bits per symbol on average, but it is always achievable to get arbitrarily close. Entropy is the true irreducible information content.

Huffman coding (1952) is the simplest optimal prefix-free code. The algorithm builds a binary tree bottom-up: repeatedly merge the two least-probable symbols into a combined node, assigning them bit suffixes 0 and 1. The resulting code assigns shorter binary strings to more probable symbols.

Example with symbols A (p=0.5), B (p=0.25), C (p=0.125), D (p=0.125):

Arithmetic coding encodes an entire message as a single number in [0, 1). It maintains a current interval and narrows it with each symbol, proportional to the symbol's probability. Unlike Huffman, it is not constrained to whole-bit codewords and achieves entropy more precisely, especially for small alphabets or heavily skewed distributions.

Lempel-Ziv-Welch (LZW) takes a different approach: dictionary-based compression that replaces repeated patterns with references to a shared codebook. It works without prior knowledge of the source distribution (it is universal). LZW underlies GIF image compression, the ZIP format, and PDF streams.

Modern compressors: Tools like zstd and Brotli combine LZ77 (dictionary matching for repeated sequences) with Huffman or ANS (Asymmetric Numeral Systems) entropy coding. ANS achieves near-arithmetic coding compression ratios at speeds approaching Huffman coding, and has become the entropy coder of choice in state-of-the-art compressors.

4. Channel Capacity and Error Correction

A real communication channel introduces noise: bit flips, erasures, and interference. Shannon's noisy channel coding theorem (the deepest result in information theory) states something remarkable: reliable communication is possible at any rate below channel capacity C, no matter how noisy the channel — as long as you use the right error-correcting code. Conversely, reliable communication is impossible above C.

The Shannon-Hartley theorem gives capacity for an additive white Gaussian noise (AWGN) channel:

Channel capacity: C = B · log₂(1 + S/N) [bits per second] B = bandwidth (Hz) S/N = signal-to-noise ratio (linear, not dB) Example: telephone channel (B=3kHz, S/N=1000): C = 3000 · log₂(1001) ≈ 30,000 bits/s Binary symmetric channel with error prob. p: C = 1 + p·log₂(p) + (1-p)·log₂(1-p) [bits per channel use]

This formula sets absolute limits on what any communication system can achieve, regardless of the modulation scheme or signal processing used. The only levers are bandwidth and signal power.

Error-correcting codes add controlled redundancy so receivers can detect and correct errors:

Applications of information theory extend far beyond communications:

Explore Algorithm Simulations

See data compression, sorting and search algorithms visualized in real time.

Explore Algorithm Simulations →

Related Content

Frequently Asked Questions

What is the Kraft inequality?

The Kraft inequality states that for any uniquely decodable code over an alphabet of size r with codeword lengths l₁, l₂, ..., lₙ, the sum Σ r^(-lᵢ) ≤ 1. Conversely, for any set of lengths satisfying Kraft's inequality, a prefix-free (instantaneously decodable) code exists with those lengths. This fundamental result connects code structure to compression theory and proves that any uniquely decodable code can be replaced by a prefix-free code of equal or shorter length.

What is the AEP (Asymptotic Equipartition Property)?

The Asymptotic Equipartition Property (AEP) is information theory's law of large numbers. For a sequence of n i.i.d. random variables with entropy H, almost all sequences of length n fall into a "typical set" of approximately 2^(nH) roughly equal-probability sequences, each with probability ≈ 2^(-nH). AEP proves that only the typical sequences need efficient encoding — the rest can be handled with longer codes — and underpins Shannon's source coding theorem.

What is channel capacity and the noisy channel coding theorem?

Channel capacity C = max_{p(x)} I(X;Y) is the maximum mutual information between input X and output Y over all input distributions. Shannon's noisy channel coding theorem guarantees: (1) rates below C are achievable with error probability approaching zero using long block codes; (2) rates above C result in unavoidable errors. This theorem was constructive in principle but Shannon's original proof used random coding — practical capacity-achieving codes (turbo codes, LDPC, polar codes) were developed decades later.

What is differential entropy and how does it differ from discrete entropy?

Differential entropy h(X) = -∫f(x)log f(x)dx extends entropy to continuous random variables. Unlike discrete entropy, differential entropy can be negative and is not invariant to deterministic transformations (scaling X by a multiplies h(X) by log|a|). The maximum differential entropy under power constraint σ² is achieved by the Gaussian distribution: h(X) = (1/2)log(2πeσ²) bits. These properties make differential entropy less "physical" than discrete entropy but useful for deriving Gaussian channel capacity.

What is the Rényi entropy?

Rényi entropy H_α(X) = (1/(1-α)) log Σ pᵢ^α generalizes Shannon entropy (recovered as α→1). Different values of α emphasize different aspects: H₀ is log of support size (number of possible outcomes), H₁ is Shannon entropy, H₂ is collision entropy (probability two random draws match), H_∞ is min-entropy (-log max pᵢ, used in cryptography). Rényi entropy appears in fractal dimension calculations, quantum information, and cryptographic security proofs.

What is the channel coding of practical systems?

Modern digital communication uses powerful error-correcting codes approaching Shannon capacity. Turbo codes (1993, used in 3G/4G) achieve near-capacity performance using two recursive convolutional codes with iterative decoding. LDPC (Low-Density Parity-Check) codes (rediscovered 1996) use sparse parity matrices and belief propagation decoding — used in 5G, WiFi, and DVB. Polar codes (Arıkan 2009) provably achieve capacity for binary symmetric channels and are part of 5G NR — the first provably capacity-achieving practical codes.

What is the relationship between information theory and cryptography?

Shannon's 1949 "Communication Theory of Secrecy Systems" founded information-theoretic cryptography. Perfect secrecy (Shannon security) requires the ciphertext reveals zero information about the plaintext — achievable only with one-time pads (keys as long as messages). Computational security (used in practice) assumes computationally bounded adversaries. Information-theoretic security includes: authentication codes that are provably secure without computational assumptions, secret sharing schemes, and physical-layer security exploiting channel noise properties.

What is Fisher information?

Fisher information I(θ) = E[(∂log p(X;θ)/∂θ)²] measures the amount of information an observable X carries about an unknown parameter θ. The Cramér-Rao bound states that any unbiased estimator's variance is at least 1/I(θ) — Fisher information limits estimation precision. Fisher information is related to KL divergence (it's the second derivative of KL at θ=0) and appears in natural gradient methods for neural network optimization, which follow the steepest descent in the space of probability distributions rather than parameter space.

What is rate-distortion theory?

Rate-distortion theory characterizes the trade-off between compression rate R (bits per source symbol) and allowed reconstruction distortion D. The rate-distortion function R(D) = min_{p(x̂|x): E[d(x,x̂)]≤D} I(X;X̂) gives the minimum rate needed to achieve distortion at most D. It proves that all lossy compression systems face a fundamental bound — the rate-distortion function — that cannot be beaten. JPEG, MP3, and video codecs are practical attempts to approach this theoretical limit.

What is the connection between information theory and statistical mechanics?

Information entropy and thermodynamic entropy are deeply connected. Boltzmann's H-theorem (his entropy formula) and Shannon's entropy formula are identical in form — Jaynes showed that statistical mechanics maximizes entropy (uncertainty) subject to known constraints (average energy), which is exactly Bayesian maximum entropy inference applied to physical systems. The free energy in statistical mechanics equals the minimum description length in information theory. This connection makes information theory the proper language for statistical mechanics, turning it from physics into inference about systems with incomplete knowledge.