Probability & Statistical Inference — Bayesian Reasoning, the Central Limit Theorem and Markov Chains

Most of science is inference under uncertainty. You cannot observe the true state of a system — only noisy measurements of it. Probability theory gives you the mathematics to reason consistently about what those measurements imply. This guide walks through seven interactive simulations, from Bayes' theorem to Markov chains and ordinary least squares — the toolkit every scientist should understand.

Probability as Belief: The Bayesian Perspective

In the frequentist view, probability is the limiting frequency of an event over infinite trials. In the Bayesian view, probability is a degree of belief — a measure of how confident you are given available evidence. Both interpretations are mathematically consistent, but Bayesian reasoning allows you to update beliefs as evidence arrives.

Bayes' theorem is just the definition of conditional probability rearranged. But that rearrangement carries enormous practical weight: it tells you how to update a prior belief P(H) — before seeing data — into a posterior belief P(H|E) after observing evidence E.

Bayes' Theorem

P(H | E) = P(E | H) · P(H) / P(E)

Posterior = Likelihood × Prior / Evidence

P(E) = Σ P(E | Hᵢ) · P(Hᵢ)   [law of total probability]

Medical test example:

Disease prevalence P(D) = 0.01 (1 in 100)

Test sensitivity P(+|D) = 0.99, specificity P(−|¬D) = 0.99

P(D|+) = 0.99×0.01 / (0.99×0.01 + 0.01×0.99) = 50%

A positive test for a rare disease is only 50% reliable!

Base rate neglect is one of the most common reasoning errors. People routinely ignore P(H) when interpreting diagnostic tests, court evidence, and risk assessments — overweighting the likelihood P(E|H) and forgetting how rare the hypothesis is. Bayesian reasoning is the antidote: always update from the prior.

The Central Limit Theorem

The central limit theorem (CLT) is arguably the most important result in statistics. It states: regardless of the shape of the population distribution, the distribution of the sample mean X̄ of n independent identically-distributed random variables approaches a normal distribution as n → ∞, with mean μ and standard deviation σ/√n.

This explains why the normal distribution appears everywhere in measurement science: you are always measuring a sum (or average) of many small independent contributions. Light intensity from a thermal source, reaction times in psychology, height in a population — all are sums of many micro-effects, so all follow approximately normal distributions.

Central Limit Theorem — Key Results

E[X̄] = μ                          (unbiased estimator)

Var[X̄] = σ²/n                     (variance shrinks with n)

SE[X̄] = σ/√n                     (standard error)

Z-score: Z = (X̄ − μ) / (σ/√n) → N(0,1)

Birthday paradox P(≥1 match in n people):

P = 1 − 365!/((365−n)! · 365ⁿ)

P(23) ≈ 0.507   P(50) ≈ 0.970   P(70) ≈ 0.999

Markov Chains: Memory-Free Processes

A Markov chain is a sequence of random variables where the probability of the next state depends only on the current state — not on the history. This Markov property ("memorylessness") is a powerful modelling assumption: weather transitions, PageRank, gene expression states, gambler's ruin, and recommendation systems are all Markov models.

Under mild conditions (irreducibility and aperiodicity), every finite Markov chain converges to a unique stationary distribution π — a probability vector that satisfies π = πP where P is the transition matrix. The convergence rate is governed by the second-largest eigenvalue of P: |λ₂| ≪ 1 means fast mixing.

Markov Chain — Stationary Distribution

Transition matrix P: P[i][j] = P(state j | state i)

Each row sums to 1: Σⱼ P[i][j] = 1

Stationary condition: π · P = π

Power iteration: πₜ₊₁ = πₜ · P (converges to π*)

Mixing time: τ ∝ 1/|ln(λ₂)|   (λ₂ = second eigenvalue)

Detailed balance (reversible): πᵢ Pᵢⱼ = πⱼ Pⱼᵢ

Linear Regression: Ordinary Least Squares

Given n data points (xᵢ, yᵢ), ordinary least squares (OLS) finds the line y = mx + b that minimises the sum of squared residuals SSE = Σ(yᵢ − mxᵢ − b)². The solution has a closed form — no iterative optimisation needed. And because OLS is equivalent to maximum likelihood under Gaussian noise, it is also the correct Bayesian estimate with a flat prior.

The coefficient of determination R² measures how much of the variance in y is explained by x: R² = 1 − SSE/SST where SST = Σ(yᵢ − ȳ)². R² = 1 means a perfect fit; R² = 0 means the model explains nothing. Crucially, R² cannot tell you whether the model is correct — only whether it fits. A non-linear relationship can produce R² ≈ 0 for a linear fit even when there is a perfect functional relationship.

Ordinary Least Squares — Closed-Form Solution

Slope: m = (n Σxᵢyᵢ − Σxᵢ Σyᵢ) / (n Σxᵢ² − (Σxᵢ)²)

Intercept: b = (Σyᵢ − m Σxᵢ) / n = ȳ − m·x̄

Pearson r = m · σₓ / σᵧ

R² = r²           (for simple linear regression)

SSE = Σ(yᵢ − mxᵢ − b)² (sum of squared residuals)

SST = Σ(yᵢ − ȳ)²        (total sum of squares)

R² = 1 − SSE/SST

Building Intuition: Surprises in Probability

Many probability results violate naive intuition. Three canonical cases:

The law of large numbers vs the central limit theorem. These two results are often confused. The LLN says X̄ → μ as n → ∞ (the average converges to the true mean). The CLT says the fluctuations of X̄ around μ converge to a normal distribution — it quantifies the rate of convergence in the LLN. The CLT is the deeper result: it tells you not just that you converge, but exactly how fast and in what shape.

Monte Carlo Methods: Simulation as Integration

Monte Carlo refers to any algorithm that uses random sampling to compute a deterministic quantity. The most famous example — estimating π — samples uniform points in a unit square and counts those inside the inscribed circle. But the real power of Monte Carlo is in high-dimensional integration: while quadrature rules grow exponentially in cost with dimension, Monte Carlo's error is always O(1/√n) regardless of dimension. For d ≥ 5 integrals, Monte Carlo is usually the only practical method.

The bootstrap is a Monte Carlo procedure for uncertainty quantification: resample your dataset with replacement B times, compute your statistic (mean, median, correlation, regression slope) on each resample, and the empirical distribution of those values approximates the true sampling distribution of the statistic. No parametric assumptions needed.

Algorithms & Methods

Bayes' Theorem Prior / Likelihood / Posterior Central Limit Theorem Standard Error σ/√n Markov Transition Matrix Power Iteration (Stationary Dist.) Ordinary Least Squares Pearson Correlation r Coefficient of Determination R² Monte Carlo Integration Bootstrap Resampling K-Means++ Initialisation Birthday Paradox Formula Gambler's Ruin (Absorbing Markov)