Spotlight #43 – Statistics & Probability: Distributions, Hypothesis Testing, Bayesian Inference and Markov Chains

Statistics is the mathematics of uncertainty. Whether you are measuring a physical constant, analysing clinical trial outcomes, or training a machine-learning model, you are drawing inferences from finite noisy data under a probabilistic model. This Spotlight develops the quantitative tools — distributions and moments, the Central Limit Theorem, frequentist and Bayesian inference, and Markov chain convergence — that underpin every branch of quantitative science.

Probability theory began as a calculus of gambling outcomes (Pascal, Fermat, Huygens, 1654–1657) and was placed on a rigorous axiomatic foundation by Kolmogorov in 1933. Today it is the language shared by physics (quantum measurement, statistical mechanics), engineering (signal detection, reliability), biology (population genetics, sequencing), economics (option pricing, auction theory) and computer science (algorithms, machine learning). This Spotlight summarises six core topics, each illustrated by an interactive simulation.

1. Probability Distributions

A probability distribution is a function that assigns a probability (or probability density) to each possible outcome. The two main families are discrete (countable outcomes) and continuous (outcomes on an interval).

Discrete Distributions

Continuous Distributions

Moments and Moment-Generating Function

E[X]     = ∫ x f(x) dx          (mean / first raw moment)
Var[X]   = E[X²] − (E[X])²      (variance)
Skew[X]  = E[(X−µ)³] / σ³      (skewness)
Kurt[X]  = E[(X−µ)⁴] / σ⁴ − 3  (excess kurtosis)

MGF: M_X(t) = E[e^{tX}]
  → k-th moment = d𝓀M_X/dt𝓀 |_{t=0}
  → For Normal: M(t) = exp(µt + ½σ²t²)
  → If X,Y independent: M_{X+Y}(t) = M_X(t) · M_Y(t)

2. Law of Large Numbers and Central Limit Theorem

Two fundamental convergence theorems govern the behaviour of sample means as sample size n grows:

The Central Limit Theorem (CLT) says more: under the Lindeberg condition (no single summand dominates the variance), the standardised sum √n(X̅−μ)/σ converges in distribution to N(0,1). Remarkably, the parent distribution can be discrete, bounded, or heavily skewed — the sum always approaches Gaussian.

Central Limit Theorem

Let X₁, X₂, …, X_n be i.i.d.  with mean µ, variance σ² < ∞.

S_n = X₁ + … + X_n

Then  (S_n − nµ) / (σ√n)  → N(0,1)  in distribution

Equivalently:  X̅_n = S_n/n  → N(µ, σ²/n)

Standard error of the mean: SE = σ / √n
  → Doubling precision requires 4× more data

The CLT is why the Normal distribution dominates statistics: any quantity that is the sum of many small independent contributions will be approximately Normal, regardless of the underlying distribution of those contributions. This explains the bell curve shape of measurement errors (Gauss 1809), the distribution of heights and IQ scores, and the thermal fluctuations of macroscopic systems.

CLT limitations: convergence can be slow for heavy-tailed distributions (e.g., Cauchy, which has no mean). The Berry–Esséen theorem bounds the maximum error in the Normal approximation as O(n−1/2) with a constant proportional to the skewness of the parent distribution.

3. Hypothesis Testing and p-Values

Frequentist hypothesis testing asks: “If the null hypothesis H0 were true, how surprising would the observed data be?” The surprise is quantified by the p-value — the probability of observing a test statistic at least as extreme as the one observed, under H0.

The Neyman–Pearson Framework

Error Types & Statistical Power

                   H₀ is TRUE     H₀ is FALSE
Reject H₀    Type I error α   Correct (power = 1−β)
Fail to reject   Correct          Type II error β

Power = P(reject H₀ | H⁡ is false) = 1 − β

One-sample z-test:  z = (X̅ − µ₀) / (σ / √n)
One-sample t-test:  t = (X̅ − µ₀) / (s / √n),  df = n − 1

Power increases with: larger n, larger effect size, larger α, smaller σ

Multiple Comparisons

When testing m independent hypotheses at level α, the probability of at least one false positive is 1−(1−α)m, which approaches 1 as m grows — the familywise error rate (FWER) problem. Solutions include:

4. Bayesian Inference

Bayesian inference treats probability as a measure of degree of belief rather than long-run frequency. It provides a principled mechanism for updating beliefs in the light of new evidence via Bayes’ theorem:

Bayes’ Theorem

P(θ | data) = P(data | θ) × P(θ) / P(data)

P(θ | data)  = posterior distribution over parameter θ
P(data | θ)  = likelihood of data given θ
P(θ)         = prior distribution (encodes existing knowledge)
P(data)        = marginal likelihood (normalising constant)

Conjugate priors — when prior and posterior have same family:
  Beta(α, β) + Binomial(n, θ)  →  Beta(α+k, β+n−k)
  N(µ₀, σ₀²) + N(θ, σ²)     →  N(µ_n, σ_n²)  (posterior Normal)
  Gamma(α, β) + Poisson(θ)     →  Gamma(α+nλ, β+n)

Credible Intervals vs Confidence Intervals

A frequentist 95 % confidence interval (CI) does not mean there is 95 % probability that θ lies in the interval for the current sample — θ is fixed (not random in the frequentist framework). The correct interpretation is: if we repeated the experiment infinitely many times and computed an interval each time, 95 % of those intervals would contain the true θ.

A Bayesian 95 % credible interval does mean exactly what it says: P(θ ∈ CI | data) = 0.95. This is typically the more natural interpretation for practitioners.

Markov Chain Monte Carlo (MCMC)

For complex models where the posterior cannot be computed analytically, MCMC algorithms generate samples from the posterior. The Metropolis–Hastings algorithm proposes candidate θ′ from a proposal distribution q(θ′|θ), then accepts with probability min(1, [p(θ′|data) q(θ|θ′)] / [p(θ|data) q(θ′|θ)]). Over many iterations the chain converges to the target posterior. Modern variants (HMC, NUTS) use gradient information to propose more efficient moves in high-dimensional parameter spaces.

5. Markov Chains and Stationary Distributions

A discrete-time Markov chain is a sequence of random variables X0, X1, … with the Markov property: P(Xn+1=j | Xn=i, Xn−1, …, X0)  = Pij — the next state depends only on the current state.

Transition Matrix & Stationary Distribution

Transition matrix P: P_ij = P(X_{n+1}=j | X_n=i),  ∑_j P_ij = 1

n-step transition:  P^n_ij = P(X_n=j | X_0=i)   (matrix power)
Chapman-Kolmogorov: P^{m+n} = P^m · P^n

Stationary distribution π: π P = π,  ∑_i π_i = 1
  → solve (P^T − I) π^T = 0 subject to ∑π_i = 1

Detailed balance (for reversible chains): π_i P_ij = π_j P_ji
  → sufficient but not necessary for π to be stationary

PageRank

Google’s original PageRank algorithm (Brin & Page, 1998) models a random surfer on the web graph. At each page the surfer follows a random link with probability d (≈ 0.85) or teleports to a random page with probability 1−d. The stationary distribution of this Markov chain assigns higher probability to pages that are linked to by other high-probability pages. The ranking vector is the leading eigenvector of the modified adjacency matrix.

PageRank Power Iteration

PR_i = (1 − d)/N + d · ∑_{j→i} PR_j / L_j

  d   = damping factor (≈ 0.85)
  N   = total number of pages
  L_j = number of outbound links from page j

Iterated to convergence: start from uniform PR_i = 1/N
  Power iteration updates all PR simultaneously each step
  Convergence criterion: ||PR_{new} − PR_{old}||₁ < 10⁻⁶

6. Maximum Likelihood Estimation and Error Propagation

Maximum likelihood estimation (MLE) finds the parameter value θ̂ that makes the observed data most probable under the assumed model: θ̂ = arg max θ L(θ; x) where L(θ; x) = P(X=x; θ) is the likelihood function. In practice we maximise log L (the log-likelihood) for numerical stability.

MLE, Fisher Information and Cramér–Rao Bound

Score function:  s(θ) = d/dθ log L(θ; x)
MLE condition:   s(θ̂) = 0

Fisher information: I(θ) = −E[d²/dθ² log L(θ)]
                           = E[s(θ)²]

Cramér–Rao bound:  Var(θ̂) ≥ 1/I(θ) / n
  → MLE is asymptotically efficient: Var(θ̂_MLE) → 1/I(θ)/n

Error propagation (delta method):
  If g(θ) is a smooth function of θ:
  Var(g(θ̂)) ≈ [g'(θ)]² · Var(θ̂)

Chi-squared goodness-of-fit:
  χ² = ∑_i (O_i − E_i)² / E_i,  df = bins − 1 − estimated params

MLE is the method of choice when the likelihood is tractable; it is consistent (converges to the true value as n→∞), asymptotically Normal, and achieves the Cramér–Rao bound under regularity conditions. When the likelihood is not tractable, Bayesian methods with conjugate priors or variational inference provide computationally accessible alternatives.

Statistics as Epistemic Infrastructure

From the birthday paradox (which surprises most people because human intuition greatly underestimates collision probabilities) to p-hacking and the replication crisis (which arise from misunderstanding what p-values mean), statistics has a reputation for being easy to misapply. The tools in this Spotlight — rigorous distribution theory, the CLT, careful hypothesis testing, Bayesian updating, and proper error propagation — are the antidote.

The interactive simulations let you watch the CLT convergence in real time with heavy-tailed parent distributions, explore how a Bayesian prior gets washed out by sufficient data, and build intuition for why more data always improves estimates at the rate √n. Statistical literacy is foundational to every quantitative discipline; these concepts appear in quantum measurement uncertainty, genomic association studies, gravitational-wave detection significance, and deep learning generalisation bounds alike.