Central Limit Theorem — Why Averages Follow a Bell Curve
The Central Limit Theorem is arguably the most important theorem in
statistics. It says that the mean of a large sample from
any distribution with finite variance will be approximately
normally distributed — regardless of whether the underlying
distribution is uniform, exponential, heavily skewed, or otherwise
non-normal. This single fact underpins the z-test, t-test, confidence
intervals, regression, and most of classical inferential statistics.
1. Formal Statement
Classical CLT: Let X₁, X₂, ..., Xₙ be i.i.d. random variables with
mean μ = E[Xᵢ] and variance σ² = Var(Xᵢ) < ∞. Define the
standardized sample mean: Zₙ = (X̄ₙ - μ) / (σ/√n) where X̄ₙ =
(X₁+...+Xₙ)/n Then: Zₙ ⟶ N(0, 1) in distribution as n → ∞
Equivalently: √n (X̄ₙ - μ) ⟶ N(0, σ²) Key consequences: • X̄ₙ is
approximately N(μ, σ²/n) for large n • 95% confidence interval: x̄ ±
1.96·σ/√n • If σ unknown: x̄ ± t_(n-1, 0.025)·s/√n (t-distribution)
Rule of thumb: n ≥ 30 is often "large enough" for well-behaved
distributions For heavy-tailed or highly skewed: may need n ≥ 100 or
more
2. Proof via Characteristic Functions
The slickest proof uses
characteristic functions (Fourier transforms of
probability distributions). The characteristic function of a random
variable X is φ_X(t) = E[e^{itX}].
Proof sketch (Lévy's continuity theorem): 1. Let Yᵢ = (Xᵢ - μ)/σ be
standardized. Then E[Yᵢ]=0, Var(Yᵢ)=1. Zₙ = (Y₁+...+Yₙ)/√n 2.
Characteristic function of standardized sum: φ_{Zₙ}(t) = φ_Y(t/√n)ⁿ
(since Yᵢ are i.i.d.) 3. Taylor-expand log φ_Y(t) around t=0: log
φ_Y(t) = log(1 + it·E[Y] - t²/2·E[Y²] + O(t³)) Since E[Y]=0, Var(Y)=1:
log φ_Y(t) = -t²/2 + O(t³) 4. Substitute: log φ_{Zₙ}(t) = n · log
φ_Y(t/√n) = n · (-(t/√n)²/2 + O((t/√n)³)) = -t²/2 + O(1/√n) → -t²/2 as
n → ∞ 5. Therefore: φ_{Zₙ}(t) → e^{-t²/2} This is exactly the
characteristic function of N(0,1). 6. By Lévy's continuity theorem: Zₙ
→ N(0,1) in distribution. □
3. Convergence Rate: Berry-Esseen Theorem
The CLT says convergence happens, but not how fast. The Berry-Esseen
theorem provides a quantitative bound:
Berry-Esseen (1941/1942): sup_x |P(Zₙ ≤ x) - Φ(x)| ≤ C · ρ / (σ³ √n)
where: ρ = E[|X - μ|³] (third absolute moment) σ²= Var(X) Φ = standard
normal CDF C ≤ 0.4748 (best known constant, Shevtsova 2011) Example:
Bernoulli(p) variable σ² = p(1-p), ρ = p(1-p)|1-2p| Max error ≤ 0.4748
× p(1-p)|1-2p| / (p(1-p))^(3/2) × 1/√n ≈ 1/(2√n) for p near 0.5 At n =
100: max CDF error ≤ 0.05 (5%) At n = 1000: max CDF error ≤ 0.016
(1.6%) Practical implication: n=30 works well for symmetric unimodal
distributions; for skewed distributions like Exponential, n=100+ is
more reliable.
4. The Galton Board
The Galton board (quincunx), invented by Sir Francis
Galton around 1876, is a physical demonstration of the CLT. A ball
falls through a triangular array of pins; at each pin it deflects left
or right with equal probability. The accumulated balls at the bottom
follow a binomial distribution that, with many rows, approximates
N(0,1).
Mathematical connection: With n rows: ball's horizontal position = sum
of n Bernoulli(0.5) steps Each step: +1 (right) or -1 (left) equally
likely Sum ~ Binomial(n, 0.5) — binomial with n steps, p=0.5 By CLT:
Binomial(n, p) → N(np, np(1-p)) as n → ∞ Standardized: → N(0, 1)
Pascal's triangle: entry C(n, k) = # paths reaching peg (n, k) Height
of bin k ∝ C(n, k) — matches normal distribution bell shape The Galton
board makes convergence to the normal distribution visceral: each
ball's final position is the addition of n independent random
variables.
More generally, any phenomenon that results from the
sum of many small, independent contributions will be
approximately normally distributed. This is why heights, measurement
errors, IQ scores, blood pressure, and many other naturally occurring
quantities are bell-shaped.
5. Sampling Distributions in Statistics
The CLT is the foundation for
sampling distributions — the distributions of
statistics computed from samples:
Sample mean X̄: ~ N(μ, σ²/n) by CLT. Standard error
= σ/√n.
Sample proportion p̂: ~ N(p, p(1-p)/n) for large n.
Used in A/B testing.
Difference of means X̄₁ − X̄₂: ~ N(μ₁−μ₂, σ₁²/n₁ +
σ₂²/n₂). Basis of two-sample t-test.
Confidence intervals: X̄ ± z_{α/2} · σ/√n contains
the true μ in (1−α)% of repeated experiments.
The "standard error" vs "standard deviation" confusion:
Standard deviation σ measures the spread of individual observations.
Standard error SE = σ/√n measures the spread of the
sample mean across repeated experiments. SE shrinks as 1/√n —
doubling your sample size reduces uncertainty in the mean by √2 ≈ 41%.
6. When the CLT Fails
The CLT has precise conditions. Violations matter in practice:
Heavy-tailed distributions (infinite variance): If
Var(X) = ∞ (e.g., Pareto with tail index α ≤ 2, Cauchy
distribution), the CLT does not apply. Sample means follow a
Lévy-stable distribution instead. Financial returns and internet
traffic often have power-law tails — sample means don't converge to
normal.
Dependent observations: The classical CLT requires
i.i.d. samples. Correlated time series data (stocks, climate
records) require the "functional CLT" or mixing conditions. However:
the CLT generalizes under weak dependence (mixing processes).
Non-identically distributed: Lyapunov's CLT covers
the i.n.i.d. case: if each variable contributes a negligible share
of total variance, the CLT still holds.
The Cauchy distribution has no mean or variance — the
sample mean of n Cauchy-distributed random variables is itself
Cauchy(0,1) distributed for all n. Averaging doesn't help. This is the
extreme counter-example to the CLT.
7. Extensions: Multivariate and Functional CLT
Multivariate CLT: Let X₁, X₂, ..., Xₙ be i.i.d. random vectors in ℝᵈ
with mean vector μ and covariance matrix Σ. Then: √n (X̄ₙ - μ) → N_d(0,
Σ) (d-dimensional normal) The multivariate normal N_d(μ, Σ) has
density: f(x) = (2π)^(-d/2) |Σ|^(-1/2) exp(-½(x-μ)ᵀ Σ⁻¹ (x-μ))
Applications: joint distribution of sample means of correlated
variables, delta method for non-linear functions of sample means,
multivariate regression.
───────────────────────────────────────────────────── Functional CLT
(Donsker's Theorem, 1951): Define partial-sum process: S_n(t) =
(X₁+...+X_{⌊nt⌋}) / (σ√n) Then S_n(·) → W(·) in distribution (in
function space C[0,1]) where W is standard Brownian motion.
Implication: Brownian motion is the universal limit of scaled random
walks. Foundation for stochastic differential equations, Black-Scholes
model, change-point detection, Brownian bridge (Kolmogorov-Smirnov
distribution).
The CLT thus connects probability theory, statistical inference,
stochastic processes, and mathematical physics into one unifying
framework. The universality of the normal distribution isn't a
coincidence — it's the mathematical consequence of averaging.