Hypothesis Testing — p-values, Type I/II Errors, and Statistical Power

Null hypothesis significance testing (NHST) is the most widely used — and most widely misunderstood — framework for statistical inference in science. From clinical drug trials to particle physics to A/B tests, researchers use p-values to decide whether observations are "statistically significant." Understanding what p-values actually mean, why α=0.05 is an arbitrary convention with consequences, and how to design studies with adequate power is essential for interpreting scientific literature critically.

1. The NHST Framework

The Neyman-Pearson hypothesis testing framework involves:

Formulate the null hypothesis H₀: The default, conservative claim — often "no effect" or "no difference." Example: H₀: the new drug doesn't change blood pressure (μ_treatment = μ_control).
Formulate the alternative hypothesis H₁: The claim you want to find evidence for. Example: H₁: the drug lowers blood pressure (μ_treatment < μ_control).
Choose a significance level α: The maximum acceptable probability of a false positive. Convention sets α = 0.05 in most social/medical science; physics uses α = 0.0000003 (5σ) for discoveries.
Collect data and compute a test statistic: A number that summarizes how "extreme" the data is relative to H₀.
Compute p-value: P(test statistic ≥ observed | H₀ is true). If p < α, reject H₀.

Rejecting H₀ ≠ proving H₁: Hypothesis testing is an asymmetric procedure. You either "reject H₀" or "fail to reject H₀." You never "accept H₀" or "prove H₁." A non-significant result means only that you didn't find sufficient evidence against H₀ — not that H₀ is true.

2. What is a p-value?

The p-value is the probability of observing data at least as extreme as what was observed, assuming H₀ is true. This is frequently and consequentially misinterpreted:

p-value IS: P(data as extreme or more extreme | H₀ true) p-value IS NOT: P(H₀ is true | data) \leftarrow posterior probability (Bayesian) P(results were due to chance) \leftarrow common misconception Effect size \leftarrow p and effect size are separate! The probability of false discovery (without prior information) Correct interpretation example: "If blood pressure doesn't differ between groups (H₀), there is a 3% probability of observing a mean difference as large as or larger than what we found, by random sampling variation alone."

The p-value is a property of the data and the null hypothesis — not of the alternative hypothesis or the world. A small p-value means only that the data are unlikely under H₀; it says nothing about whether H₁ is true or how large the effect is.

3. Type I and Type II Errors

Decision table: H₀ is TRUE H₀ is FALSE Reject H₀ | Type I Error | Correct Rejection | (α, false pos)| (Power = 1-β) Fail to Reject | Correct | Type II Error | (1-α) | (β, false negative) Type I error (α): Rejecting H₀ when H₀ is actually true "False positive" — E.g., concluding drug works when it doesn't Controlled directly by choice of α = 0.05 Type II error (β): Failing to reject H₀ when H₀ is false "False negative" — E.g., missing a real drug effect in the study Convention: β \leq 0.20 (80% power), β \leq 0.10 (90% power) α and β are inversely related (for fixed N): decreasing α \to increasing β (being more conservative misses more real effects)

The appropriate balance between Type I and Type II errors depends on the consequences. In drug safety testing, failing to detect a harmful side effect (Type II) may be worse than a false alarm; in particle physics, reducing α to 5σ prevents "discoveries" that disappear (Type I) from dominating the literature.

4. Statistical Power

Statistical power = 1 − β = probability of correctly rejecting H₀ when H₁ is true. Power depends on four quantities:

Power = f(α, n, σ, δ) α : significance level (higher α \to higher power) n : sample size (larger n \to higher power) σ : population standard deviation (smaller σ \to higher power) δ : true effect size = |μ₁ - μ₀| (larger effect \to higher power) For a one-sample z-test: Power = Φ(z_α - δ\sqrtn/σ) where Φ = standard normal CDF, z_α = z-critical Required n for target power (1-β), significance α, effect size d = δ/σ: n \approx (z_α + z_β)² / d² [one-sample, one-sided] n \approx 2(z_α/2 + z_β)² / d² [two-sample equal groups, two-sided] Example: d = 0.5 (medium effect), α=0.05, power=0.80: n \approx 2 \times (1.96 + 0.84)² / 0.25 \approx 63 per group

Power analysis before data collection determines whether a study is adequately powered to detect the expected effect. Studies with <80% power are likely to miss real but small effects and produce non-reproducible results when they do find something (the "winner's curse").

5. Common Tests

Test Use case Statistic ───────────────────────────────────────────────────────────────────── One-sample t-test Single group vs. known mean t = (x̄-μ₀)/(s/\sqrtn) Two-sample t-test Compare two independent groups t = (x̄₁-x̄₂)/se_diff Paired t-test Before-after, matched pairs t = d̄/(s_d/\sqrtn) Chi-square (goodness) Observed vs. expected counts χ² = Σ(O-E)²/E Chi-square (independence) Two categorical variables χ² = Σ(O-E)²/E ANOVA (F-test) \geq3 groups, continuous outcome F = MS_between/MS_within Mann-Whitney U Non-normal / ordinal data Rank-sum statistic Pearson correlation Linear relationship test t = r\sqrt(n-2)/\sqrt(1-r²) Degrees of freedom determine the null distribution: t(n-1), χ²(k-1), F(k-1, N-k), etc.

6. Effect Sizes

A p-value tells you whether an effect is "real" (not just sampling noise) but not whether it is meaningful. Effect size measures the magnitude of the effect independently of sample size:

Cohen's d: (μ₁ − μ₂) / σ_pooled. Conventions: small = 0.2, medium = 0.5, large = 0.8.
r (correlation): small = 0.1, medium = 0.3, large = 0.5.
η² (eta squared): proportion of variance explained in ANOVA. η² = SS_between / SS_total.
Odds ratio / Risk ratio: natural effect measure for binary clinical outcomes.

With n = 1,000,000, a completely trivial effect (d = 0.01) will produce p < 0.001. Always report effect sizes alongside p-values. A medical treatment that reduces systolic blood pressure by 0.2 mmHg (d≈0.05) is "statistically highly significant" but clinically meaningless.

7. The Replication Crisis

A 2015 study (Open Science Collaboration) attempted to replicate 100 published psychology studies. Only 36–39% showed effects in the same direction at p < 0.05. Similar findings emerged in cancer biology, economics, and nutrition research. Contributing factors:

p-hacking: Testing multiple hypotheses but only reporting significant ones; stopping data collection when p crosses 0.05.
HARKing (Hypothesizing After Results are Known): Presenting post-hoc exploratory analysis as pre-specified confirmatory testing.
Publication bias: Journals preferentially publish p < 0.05 results; null results disappear into "file drawers."
Underpowered studies: Small samples produce unreliable estimates even when p is significant.
Multiple comparisons: Testing 20 independent hypotheses at α=0.05 expects 1 false positive by chance.

Solutions: Pre-registration of study design and analysis plan before data collection; Bonferroni correction or Benjamini-Hochberg false discovery rate control for multiple comparisons; Bayesian analysis as an alternative framework; reporting effect sizes with confidence intervals alongside p-values; open data and code for reproducibility.

📐 Explore Mathematics →