Hypothesis Testing — p-values, Type I/II Errors, and Statistical Power
Null hypothesis significance testing (NHST) is the most widely used — and most widely misunderstood — framework for statistical inference in science. From clinical drug trials to particle physics to A/B tests, researchers use p-values to decide whether observations are "statistically significant." Understanding what p-values actually mean, why α=0.05 is an arbitrary convention with consequences, and how to design studies with adequate power is essential for interpreting scientific literature critically.
1. The NHST Framework
The Neyman-Pearson hypothesis testing framework involves:
- Formulate the null hypothesis H₀: The default, conservative claim — often "no effect" or "no difference." Example: H₀: the new drug doesn't change blood pressure (μ_treatment = μ_control).
- Formulate the alternative hypothesis H₁: The claim you want to find evidence for. Example: H₁: the drug lowers blood pressure (μ_treatment < μ_control).
- Choose a significance level α: The maximum acceptable probability of a false positive. Convention sets α = 0.05 in most social/medical science; physics uses α = 0.0000003 (5σ) for discoveries.
- Collect data and compute a test statistic: A number that summarizes how "extreme" the data is relative to H₀.
- Compute p-value: P(test statistic ≥ observed | H₀ is true). If p < α, reject H₀.
2. What is a p-value?
The p-value is the probability of observing data at least as extreme as what was observed, assuming H₀ is true. This is frequently and consequentially misinterpreted:
The p-value is a property of the data and the null hypothesis — not of the alternative hypothesis or the world. A small p-value means only that the data are unlikely under H₀; it says nothing about whether H₁ is true or how large the effect is.
3. Type I and Type II Errors
The appropriate balance between Type I and Type II errors depends on the consequences. In drug safety testing, failing to detect a harmful side effect (Type II) may be worse than a false alarm; in particle physics, reducing α to 5σ prevents "discoveries" that disappear (Type I) from dominating the literature.
4. Statistical Power
Statistical power = 1 − β = probability of correctly rejecting H₀ when H₁ is true. Power depends on four quantities:
Power analysis before data collection determines whether a study is adequately powered to detect the expected effect. Studies with <80% power are likely to miss real but small effects and produce non-reproducible results when they do find something (the "winner's curse").
5. Common Tests
6. Effect Sizes
A p-value tells you whether an effect is "real" (not just sampling noise) but not whether it is meaningful. Effect size measures the magnitude of the effect independently of sample size:
- Cohen's d: (μ₁ − μ₂) / σ_pooled. Conventions: small = 0.2, medium = 0.5, large = 0.8.
- r (correlation): small = 0.1, medium = 0.3, large = 0.5.
- η² (eta squared): proportion of variance explained in ANOVA. η² = SS_between / SS_total.
- Odds ratio / Risk ratio: natural effect measure for binary clinical outcomes.
With n = 1,000,000, a completely trivial effect (d = 0.01) will produce p < 0.001. Always report effect sizes alongside p-values. A medical treatment that reduces systolic blood pressure by 0.2 mmHg (d≈0.05) is "statistically highly significant" but clinically meaningless.
7. The Replication Crisis
A 2015 study (Open Science Collaboration) attempted to replicate 100 published psychology studies. Only 36–39% showed effects in the same direction at p < 0.05. Similar findings emerged in cancer biology, economics, and nutrition research. Contributing factors:
- p-hacking: Testing multiple hypotheses but only reporting significant ones; stopping data collection when p crosses 0.05.
- HARKing (Hypothesizing After Results are Known): Presenting post-hoc exploratory analysis as pre-specified confirmatory testing.
- Publication bias: Journals preferentially publish p < 0.05 results; null results disappear into "file drawers."
- Underpowered studies: Small samples produce unreliable estimates even when p is significant.
- Multiple comparisons: Testing 20 independent hypotheses at α=0.05 expects 1 false positive by chance.