Bayes' Theorem in Practice — Prior, Likelihood, Posterior
Written in 1763 in a posthumous essay by the Reverend Thomas Bayes, and independently rediscovered by Pierre-Simon Laplace, the theorem tells us how to update a probability estimate when new evidence arrives. Today it underpins spam filters, medical diagnosis, search engines, machine learning, and scientific parameter estimation — anywhere we need to reason quantitatively about uncertainty.
1. Conditional Probability
The conditional probability P(A|B) is the probability of event A given that B has occurred. If we restrict the sample space to outcomes where B holds, P(A|B) is the fraction of those outcomes that also satisfy A:
This can be rearranged to the product rule: P(A ∩ B) = P(A|B) · P(B). The same joint probability can also be written P(A ∩ B) = P(B|A) · P(A). Setting these equal gives Bayes' theorem.
2. Deriving Bayes' Theorem
Start from the symmetry of the joint probability:
∴ P(H|E) = P(E|H) · P(H) / P(E)
That is all. The theorem is nothing more than the product rule applied twice. Its power comes from choosing what to condition on:
- H is the hypothesis we care about (the model, the parameter value, the diagnosis).
- E is the evidence (the observed data, the test result, the measurement).
The normalising constant P(E) is computed via the law of total probability by summing over all mutually exclusive hypotheses Hi:
3. Prior, Likelihood, Posterior, Evidence
Prior
What we believed about H before seeing the data. Encodes domain knowledge, previous experiments, or ignorance.
Likelihood
How probable is the evidence if hypothesis H is true? This is what the model predicts for the data.
Posterior
Our updated belief after incorporating the evidence. The prior shifted by the data. This becomes the new prior for the next update.
The denominator P(E) is often called the marginal likelihood or model evidence. It is the same for all hypotheses and acts as a normalisation constant ensuring the posterior sums/integrates to 1.
4. Example: Medical Test Accuracy
A disease affects 1% of the population. A test has 99% sensitivity (P(+|disease) = 0.99) and 98% specificity (P(−|no disease) = 0.98, so P(+|no disease) = 0.02). What is the probability of actually having the disease given a positive test result?
P(+) = P(+|D)·P(D) + P(+|¬D)·P(¬D) = 0.99×0.01 + 0.02×0.99 = 0.0099 + 0.0198 = 0.0297
P(D|+) = 0.99×0.01 / 0.0297 ≈ 0.333
Despite a highly accurate test, a positive result means only a 33% chance of actually having the disease — because the disease is rare (low prior). This is the base rate fallacy: ignoring the prior leads to wildly overconfident diagnoses. The Bayesian framework makes this explicit.
5. Conjugate Priors — Closed-Form Updates
When the prior and posterior belong to the same distributional family, the prior is called conjugate to the likelihood. This allows Bayesian updating without numerical integration.
Beta-Binomial: Estimating a Coin's Bias
Suppose we flip a coin n times and observe k heads. The likelihood is Binomial(n, θ). If the prior on the bias θ is Beta(α, β), the posterior is also Beta:
// Posterior mean = (α+k)/(α+β+n) // Uniform prior: α=β=1 → posterior mean = (k+1)/(n+2) [Laplace smoothing]
Other Conjugate Pairs
- Gaussian likelihood + Gaussian prior → Gaussian posterior (linear Kalman filter)
- Poisson likelihood + Gamma prior → Gamma posterior (event rate estimation)
- Categorical likelihood + Dirichlet prior → Dirichlet posterior (language models)
- Exponential likelihood + Gamma prior → Gamma posterior (survival analysis)
6. Bayesian vs Frequentist
The debate is philosophical, but the practical consequences are real:
- Frequentist (classical) statistics treats probability as the long-run frequency in repeated experiments. Parameters are fixed but unknown; confidence intervals and p-values describe properties of the procedure, not the parameter.
- Bayesian statistics treats probability as a degree of belief that can attach to single events and to parameters. The posterior directly answers "given what I observed, how probable is each value of θ?"
In practice, with large data and relatively uninformative priors, Bayesian and frequentist point estimates converge. The divergence is most important with sparse data (where the prior matters), multiple comparisons (where the Bayesian hierarchy regularises automatically), and sequential experiments (where Bayesian updating is natural).
7. JavaScript: Bayesian Coin-Flip and Naive Bayes
Sequential Bayesian Updating of a Coin Bias
// Beta distribution for coin bias θ ∈ [0,1]
// Analytical update: after observing [heads, tails], increment α and β.
class BetaCoinInference {
constructor(alpha = 1, beta = 1) { // uniform prior
this.alpha = alpha;
this.beta = beta;
}
observe(heads, tails) {
this.alpha += heads;
this.beta += tails;
}
get mean() { return this.alpha / (this.alpha + this.beta); }
get variance() {
const s = this.alpha + this.beta;
return (this.alpha * this.beta) / (s * s * (s + 1));
}
get credible95() {
// Rough 95% HPD using Normal approximation to Beta
const mu = this.mean;
const std = Math.sqrt(this.variance);
return [Math.max(0, mu - 1.96*std), Math.min(1, mu + 1.96*std)];
}
}
const coin = new BetaCoinInference();
coin.observe(3, 1);
console.log(coin.mean.toFixed(3)); // 0.800 (prior α=β=1, data 3H 1T → α=4,β=2)
coin.observe(7, 9);
console.log(coin.mean.toFixed(3)); // 0.526 (total 10H 10T → mean near 0.5)
console.log(coin.credible95); // [~0.36, ~0.69]
Naive Bayes Spam Classifier
// Naive Bayes: P(spam|words) ∝ P(spam) × Π P(word|spam)
// Assumes word occurrences are conditionally independent given class.
class NaiveBayes {
constructor() {
this.counts = { spam: {}, ham: {} };
this.totals = { spam: 0, ham: 0 };
this.docCount= { spam: 0, ham: 0 };
}
train(text, label) {
this.docCount[label]++;
for (const word of text.toLowerCase().split(/\W+/)) {
if (!word) continue;
this.counts[label][word] = (this.counts[label][word] || 0) + 1;
this.totals[label]++;
}
}
_logP(word, label) {
// Laplace smoothing: add 1 to every word count
const vocabSize = Object.keys({...this.counts.spam, ...this.counts.ham}).length;
const count = (this.counts[label][word] || 0) + 1;
return Math.log(count / (this.totals[label] + vocabSize));
}
classify(text) {
const words = text.toLowerCase().split(/\W+/).filter(Boolean);
const total = this.docCount.spam + this.docCount.ham;
const logSpam = Math.log(this.docCount.spam / total)
+ words.reduce((s, w) => s + this._logP(w, 'spam'), 0);
const logHam = Math.log(this.docCount.ham / total)
+ words.reduce((s, w) => s + this._logP(w, 'ham'), 0);
return logSpam > logHam ? 'spam' : 'ham';
}
}
const nb = new NaiveBayes();
nb.train('free money win prize lottery', 'spam');
nb.train('click here to claim your reward', 'spam');
nb.train('meeting tomorrow at 3pm in the office', 'ham');
nb.train('please review the attached report', 'ham');
console.log(nb.classify('free prize money claim')); // 'spam'
console.log(nb.classify('office meeting agenda review')); // 'ham'
8. Applications across Fields
Kalman Filtering (Navigation)
The Kalman filter is an optimal Bayesian estimator for linear Gaussian systems. Each measurement update step applies Bayes' theorem — multiplying a Gaussian prior by a Gaussian likelihood to get a Gaussian posterior — using closed-form matrix equations. GPS receivers, aircraft autopilots, and robot localization all run Kalman filters.
Bayesian Neural Networks
Instead of learning a single weight vector, a Bayesian neural network learns a posterior distribution over weights. This provides calibrated uncertainty estimates and natural regularisation (the posterior peaks at the MAP estimate, which equals L2-regularised maximum likelihood under a Gaussian prior). Variational inference and Monte Carlo dropout are practical approximations.
A/B Testing and Conversion Rate Optimisation
Instead of waiting for a fixed sample size and running a frequentist test, Bayesian A/B testing continuously updates the posterior over conversion rates θA and θB. You can compute P(θB > θA | data) at any time and stop early with full confidence quantification — avoiding both underpowered tests and inflated false positives from optional stopping.
Interactive probability
Explore random walks, Galton boards, and the Central Limit Theorem — and see how posterior distributions narrow as sample size grows.