Central Limit Theorem — Why Averages Follow a Bell Curve

The Central Limit Theorem is arguably the most important theorem in statistics. It says that the mean of a large sample from any distribution with finite variance will be approximately normally distributed — regardless of whether the underlying distribution is uniform, exponential, heavily skewed, or otherwise non-normal. This single fact underpins the z-test, t-test, confidence intervals, regression, and most of classical inferential statistics.

1. Formal Statement

Classical CLT: Let X₁, X₂, ..., Xₙ be i.i.d. random variables with mean μ = E[Xᵢ] and variance σ² = Var(Xᵢ) < \infty. Define the standardized sample mean: Zₙ = (X̄ₙ - μ) / (σ/\sqrtn) where X̄ₙ = (X₁+...+Xₙ)/n Then: Zₙ ⟶ N(0, 1) in distribution as n \to \infty Equivalently: \sqrtn (X̄ₙ - μ) ⟶ N(0, σ²) Key consequences: • X̄ₙ is approximately N(μ, σ²/n) for large n • 95% confidence interval: x̄ \pm 1.96\cdotσ/\sqrtn • If σ unknown: x̄ \pm t_(n-1, 0.025)\cdots/\sqrtn (t-distribution) Rule of thumb: n \geq 30 is often "large enough" for well-behaved distributions For heavy-tailed or highly skewed: may need n \geq 100 or more

2. Proof via Characteristic Functions

The slickest proof uses characteristic functions (Fourier transforms of probability distributions). The characteristic function of a random variable X is φ_X(t) = E[e^{itX}].

Proof sketch (Lévy's continuity theorem): 1. Let Yᵢ = (Xᵢ - μ)/σ be standardized. Then E[Yᵢ]=0, Var(Yᵢ)=1. Zₙ = (Y₁+...+Yₙ)/√n 2. Characteristic function of standardized sum: φ_{Zₙ}(t) = φ_Y(t/√n)ⁿ (since Yᵢ are i.i.d.) 3. Taylor-expand log φ_Y(t) around t=0: log φ_Y(t) = log(1 + it·E[Y] - t²/2·E[Y²] + O(t³)) Since E[Y]=0, Var(Y)=1: log φ_Y(t) = -t²/2 + O(t³) 4. Substitute: log φ_{Zₙ}(t) = n · log φ_Y(t/√n) = n · (-(t/√n)²/2 + O((t/√n)³)) = -t²/2 + O(1/√n) → -t²/2 as n → ∞ 5. Therefore: φ_{Zₙ}(t) → e^{-t²/2} This is exactly the characteristic function of N(0,1). 6. By Lévy's continuity theorem: Zₙ → N(0,1) in distribution. □

3. Convergence Rate: Berry-Esseen Theorem

The CLT says convergence happens, but not how fast. The Berry-Esseen theorem provides a quantitative bound:

Berry-Esseen (1941/1942): sup_x |P(Zₙ \leq x) - Φ(x)| \leq C \cdot ρ / (σ³ \sqrtn) where: ρ = E[|X - μ|³] (third absolute moment) σ²= Var(X) Φ = standard normal CDF C \leq 0.4748 (best known constant, Shevtsova 2011) Example: Bernoulli(p) variable σ² = p(1-p), ρ = p(1-p)|1-2p| Max error \leq 0.4748 \times p(1-p)|1-2p| / (p(1-p))^(3/2) \times 1/\sqrtn \approx 1/(2\sqrtn) for p near 0.5 At n = 100: max CDF error \leq 0.05 (5%) At n = 1000: max CDF error \leq 0.016 (1.6%) Practical implication: n=30 works well for symmetric unimodal distributions; for skewed distributions like Exponential, n=100+ is more reliable.

4. The Galton Board

The Galton board (quincunx), invented by Sir Francis Galton around 1876, is a physical demonstration of the CLT. A ball falls through a triangular array of pins; at each pin it deflects left or right with equal probability. The accumulated balls at the bottom follow a binomial distribution that, with many rows, approximates N(0,1).

Mathematical connection: With n rows: ball's horizontal position = sum of n Bernoulli(0.5) steps Each step: +1 (right) or -1 (left) equally likely Sum ~ Binomial(n, 0.5) — binomial with n steps, p=0.5 By CLT: Binomial(n, p) \to N(np, np(1-p)) as n \to \infty Standardized: \to N(0, 1) Pascal's triangle: entry C(n, k) = # paths reaching peg (n, k) Height of bin k \propto C(n, k) — matches normal distribution bell shape The Galton board makes convergence to the normal distribution visceral: each ball's final position is the addition of n independent random variables.

More generally, any phenomenon that results from the sum of many small, independent contributions will be approximately normally distributed. This is why heights, measurement errors, IQ scores, blood pressure, and many other naturally occurring quantities are bell-shaped.

5. Sampling Distributions in Statistics

The CLT is the foundation for sampling distributions — the distributions of statistics computed from samples:

Sample mean X̄: ~ N(μ, σ²/n) by CLT. Standard error = σ/√n.
Sample proportion p̂: ~ N(p, p(1-p)/n) for large n. Used in A/B testing.
Difference of means X̄₁ − X̄₂: ~ N(μ₁−μ₂, σ₁²/n₁ + σ₂²/n₂). Basis of two-sample t-test.
Confidence intervals: X̄ ± z_{α/2} · σ/√n contains the true μ in (1−α)% of repeated experiments.

The "standard error" vs "standard deviation" confusion: Standard deviation σ measures the spread of individual observations. Standard error SE = σ/√n measures the spread of the sample mean across repeated experiments. SE shrinks as 1/√n — doubling your sample size reduces uncertainty in the mean by √2 ≈ 41%.

6. When the CLT Fails

The CLT has precise conditions. Violations matter in practice:

Heavy-tailed distributions (infinite variance): If Var(X) = ∞ (e.g., Pareto with tail index α ≤ 2, Cauchy distribution), the CLT does not apply. Sample means follow a Lévy-stable distribution instead. Financial returns and internet traffic often have power-law tails — sample means don't converge to normal.
Dependent observations: The classical CLT requires i.i.d. samples. Correlated time series data (stocks, climate records) require the "functional CLT" or mixing conditions. However: the CLT generalizes under weak dependence (mixing processes).
Non-identically distributed: Lyapunov's CLT covers the i.n.i.d. case: if each variable contributes a negligible share of total variance, the CLT still holds.

The Cauchy distribution has no mean or variance — the sample mean of n Cauchy-distributed random variables is itself Cauchy(0,1) distributed for all n. Averaging doesn't help. This is the extreme counter-example to the CLT.

7. Extensions: Multivariate and Functional CLT

Multivariate CLT: Let X₁, X₂, ..., Xₙ be i.i.d. random vectors in ℝᵈ with mean vector μ and covariance matrix Σ. Then: √n (X̄ₙ - μ) → N_d(0, Σ) (d-dimensional normal) The multivariate normal N_d(μ, Σ) has density: f(x) = (2π)^(-d/2) |Σ|^(-1/2) exp(-½(x-μ)ᵀ Σ⁻¹ (x-μ)) Applications: joint distribution of sample means of correlated variables, delta method for non-linear functions of sample means, multivariate regression. ───────────────────────────────────────────────────── Functional CLT (Donsker's Theorem, 1951): Define partial-sum process: S_n(t) = (X₁+...+X_{⌊nt⌋}) / (σ√n) Then S_n(·) → W(·) in distribution (in function space C[0,1]) where W is standard Brownian motion. Implication: Brownian motion is the universal limit of scaled random walks. Foundation for stochastic differential equations, Black-Scholes model, change-point detection, Brownian bridge (Kolmogorov-Smirnov distribution).

The CLT thus connects probability theory, statistical inference, stochastic processes, and mathematical physics into one unifying framework. The universality of the normal distribution isn't a coincidence — it's the mathematical consequence of averaging.

📐 Explore Mathematics →