Probability as Belief: The Bayesian Perspective
In the frequentist view, probability is the limiting frequency of an event over infinite trials. In the Bayesian view, probability is a degree of belief — a measure of how confident you are given available evidence. Both interpretations are mathematically consistent, but Bayesian reasoning allows you to update beliefs as evidence arrives.
Bayes' theorem is just the definition of conditional probability rearranged. But that rearrangement carries enormous practical weight: it tells you how to update a prior belief P(H) — before seeing data — into a posterior belief P(H|E) after observing evidence E.
Bayes' Theorem
P(H | E) = P(E | H) · P(H) / P(E)
Posterior = Likelihood × Prior / Evidence
P(E) = Σ P(E | Hᵢ) · P(Hᵢ) [law of total probability]
Medical test example:
Disease prevalence P(D) = 0.01 (1 in 100)
Test sensitivity P(+|D) = 0.99, specificity P(−|¬D) = 0.99
P(D|+) = 0.99×0.01 / (0.99×0.01 + 0.01×0.99) = 50%
A positive test for a rare disease is only 50% reliable!
Bayesian Inference
Interactive Bayes' theorem: prior / likelihood / posterior curves updated live. Medical test scenario with prevalence, sensitivity, and specificity sliders. See why a positive result for a rare disease is often misleading.
Bootstrap Resampling
Construct confidence intervals for any statistic by resampling your data with replacement. See the bootstrap distribution converge to the true sampling distribution as the number of resamples grows.
Base rate neglect is one of the most common reasoning errors. People routinely ignore P(H) when interpreting diagnostic tests, court evidence, and risk assessments — overweighting the likelihood P(E|H) and forgetting how rare the hypothesis is. Bayesian reasoning is the antidote: always update from the prior.
The Central Limit Theorem
The central limit theorem (CLT) is arguably the most important result in statistics. It states: regardless of the shape of the population distribution, the distribution of the sample mean X̄ of n independent identically-distributed random variables approaches a normal distribution as n → ∞, with mean μ and standard deviation σ/√n.
This explains why the normal distribution appears everywhere in measurement science: you are always measuring a sum (or average) of many small independent contributions. Light intensity from a thermal source, reaction times in psychology, height in a population — all are sums of many micro-effects, so all follow approximately normal distributions.
Central Limit Theorem
5 source distributions (uniform, exponential, bimodal, Poisson, right-skewed). Sample-size slider n=1..200. Animated accumulation of sample-mean histogram with normal approximation overlay. Live empirical μ/σ vs theoretical σ/√n.
Birthday Paradox
Exact P(n) curve + Monte Carlo simulation (Uint8Array(365) for speed). Two views: probability graph and room-ring with 365-day slots. The 50% crossover happens at just 23 people — most find this impossible to believe until they see it animate.
Central Limit Theorem — Key Results
E[X̄] = μ (unbiased estimator)
Var[X̄] = σ²/n (variance shrinks with n)
SE[X̄] = σ/√n (standard error)
Z-score: Z = (X̄ − μ) / (σ/√n) → N(0,1)
Birthday paradox P(≥1 match in n people):
P = 1 − 365!/((365−n)! · 365ⁿ)
P(23) ≈ 0.507 P(50) ≈ 0.970 P(70) ≈ 0.999
Markov Chains: Memory-Free Processes
A Markov chain is a sequence of random variables where the probability of the next state depends only on the current state — not on the history. This Markov property ("memorylessness") is a powerful modelling assumption: weather transitions, PageRank, gene expression states, gambler's ruin, and recommendation systems are all Markov models.
Under mild conditions (irreducibility and aperiodicity), every finite Markov chain converges to a unique stationary distribution π — a probability vector that satisfies π = πP where P is the transition matrix. The convergence rate is governed by the second-largest eigenvalue of P: |λ₂| ≪ 1 means fast mixing.
Markov Chains
Animated state-transition diagram, random walk, and live comparison of empirical vs theoretical stationary distribution (power iteration). Presets: weather, PageRank, gambler's ruin, genetics, random 4-state. Edit transition probabilities live.
Monte Carlo π
Classic dart-throwing estimation of π: random points in a unit square, count those inside the quarter-circle. Live convergence graph, σ-error band, and comparison with the exact value. Extension mode: high-precision with variance reduction.
Markov Chain — Stationary Distribution
Transition matrix P: P[i][j] = P(state j | state i)
Each row sums to 1: Σⱼ P[i][j] = 1
Stationary condition: π · P = π
Power iteration: πₜ₊₁ = πₜ · P (converges to π*)
Mixing time: τ ∝ 1/|ln(λ₂)| (λ₂ = second eigenvalue)
Detailed balance (reversible): πᵢ Pᵢⱼ = πⱼ Pⱼᵢ
Linear Regression: Ordinary Least Squares
Given n data points (xᵢ, yᵢ), ordinary least squares (OLS) finds the line y = mx + b that minimises the sum of squared residuals SSE = Σ(yᵢ − mxᵢ − b)². The solution has a closed form — no iterative optimisation needed. And because OLS is equivalent to maximum likelihood under Gaussian noise, it is also the correct Bayesian estimate with a flat prior.
The coefficient of determination R² measures how much of the variance in y is explained by x: R² = 1 − SSE/SST where SST = Σ(yᵢ − ȳ)². R² = 1 means a perfect fit; R² = 0 means the model explains nothing. Crucially, R² cannot tell you whether the model is correct — only whether it fits. A non-linear relationship can produce R² ≈ 0 for a linear fit even when there is a perfect functional relationship.
Linear Regression (OLS)
640×400 scatter plot: click to add/drag/delete points, live OLS line, slope m/intercept b/R²/Pearson r/SSE. Residuals as dashed verticals. 5 preset datasets: random, perfect, outlier, no correlation, quadratic (showing limits of linear fit).
K-Means Clustering
K-Means++ initialisation with Voronoi zones, elbow method K=2..8, click to add points. Three data generators: blobs, moons, rings. Centroid trajectory trails. See how the elbow criterion finds the true number of clusters.
Ordinary Least Squares — Closed-Form Solution
Slope: m = (n Σxᵢyᵢ − Σxᵢ Σyᵢ) / (n Σxᵢ² − (Σxᵢ)²)
Intercept: b = (Σyᵢ − m Σxᵢ) / n = ȳ − m·x̄
Pearson r = m · σₓ / σᵧ
R² = r² (for simple linear regression)
SSE = Σ(yᵢ − mxᵢ − b)² (sum of squared residuals)
SST = Σ(yᵢ − ȳ)² (total sum of squares)
R² = 1 − SSE/SST
Building Intuition: Surprises in Probability
Many probability results violate naive intuition. Three canonical cases:
- Birthday paradox: only 23 people needed for a 50% chance of a shared birthday. Our brains anchor to "how many pairs match me?" (1/365 each), ignoring the O(n²) pairs.
- Medical testing: a 99%-accurate test for a disease affecting 1% of the population gives a positive predictive value of only ~50% — equal to a coin flip. The base rate dominates.
- Gambler's ruin: in a fair game (p = 0.5), a player with finite capital will eventually go broke with probability 1, no matter how long they gamble. The Markov chain is a random walk with absorbing barriers — and the absorbing state is ruin.
The law of large numbers vs the central limit theorem. These two results are often confused. The LLN says X̄ → μ as n → ∞ (the average converges to the true mean). The CLT says the fluctuations of X̄ around μ converge to a normal distribution — it quantifies the rate of convergence in the LLN. The CLT is the deeper result: it tells you not just that you converge, but exactly how fast and in what shape.
Monte Carlo Methods: Simulation as Integration
Monte Carlo refers to any algorithm that uses random sampling to compute a deterministic quantity. The most famous example — estimating π — samples uniform points in a unit square and counts those inside the inscribed circle. But the real power of Monte Carlo is in high-dimensional integration: while quadrature rules grow exponentially in cost with dimension, Monte Carlo's error is always O(1/√n) regardless of dimension. For d ≥ 5 integrals, Monte Carlo is usually the only practical method.
The bootstrap is a Monte Carlo procedure for uncertainty quantification: resample your dataset with replacement B times, compute your statistic (mean, median, correlation, regression slope) on each resample, and the empirical distribution of those values approximates the true sampling distribution of the statistic. No parametric assumptions needed.