Bootstrap Resampling: Statistical Inference Without Assumptions

Most classical statistics rests on assumptions: that the data follow a normal distribution, that the sample is large enough for the central limit theorem to kick in, or that a formula for the standard error of your particular statistic even exists. Bootstrap resampling, introduced by Bradley Efron in 1979, sidesteps all of these requirements. The core idea is elegant: if you cannot draw more samples from the unknown population, pretend your observed data is the population and draw from it instead — repeatedly, with replacement, using a computer. Each synthetic resample yields an estimate of your statistic, and the spread of those estimates directly approximates the sampling variability you care about. The result is a general-purpose tool for constructing confidence intervals, standard errors and hypothesis tests that works for means, medians, ratios, correlations, regression coefficients, and any other quantity you can compute.

The core algorithm: sampling with replacement

Suppose you have a dataset of n observations, x1, x2, ..., xn, and you want to estimate the uncertainty in some statistic T — say, the median or the interquartile range. The nonparametric bootstrap proceeds as follows:

  1. Compute the observed statistic Tobs from the original data.
  2. Draw a bootstrap sample x*1, x*2, ..., x*n by sampling n values uniformly at random from the original dataset, with replacement. This means each observation can appear zero, one, or many times in a single resample.
  3. Compute the statistic T* from the bootstrap sample.
  4. Repeat steps 2–3 a large number of times B (typically 1 000 to 10 000), yielding a collection of bootstrap statistics T*1, T*2, ..., T*B.
  5. Summarise the distribution of T* to obtain confidence intervals or standard errors.

The theoretical justification is that the empirical distribution of the data (the distribution that places probability 1/n on each observed value) is the best nonparametric estimate of the true population distribution. Resampling from it therefore mimics resampling from the true distribution, at least as well as the data represent it. As n grows, the empirical distribution converges to the true distribution, and bootstrap estimates converge to the exact sampling distribution of T.

On average, each bootstrap resample contains about 1 - e-1 ≈ 63.2% of the unique original observations, with the remaining roughly 36.8% not selected. This fact underpins the out-of-bag error estimate in random forests, where the unselected observations form a natural validation set for each tree.

Constructing bootstrap confidence intervals

Several methods exist for turning the B bootstrap statistics into a confidence interval. The simplest is the percentile method: sort the bootstrap statistics and take the (α/2)th and (1 - α/2)th percentiles as the interval endpoints. For a 95% interval with 1 000 bootstrap resamples, this means taking the 25th and 975th ordered values:

CI = [T*(25), T*(975)]

The percentile method is simple but can be biased when the bootstrap distribution is not centred on the observed statistic. The basic (or pivotal) method corrects for this by reflecting around Tobs:

CI = [2Tobs − T*(1-α/2), 2Tobs − T*(α/2)]

The most statistically sophisticated common method is the bias-corrected and accelerated (BCa) interval, which applies two corrections. The bias-correction z0 accounts for the fraction of bootstrap statistics that fall below Tobs: z0 = Φ-1(#{T*b < Tobs} / B). The acceleration a measures how the standard error of the statistic changes with the parameter value, estimated via jackknife. These corrections shift the percentile cutpoints and produce intervals with better coverage properties, particularly for skewed or biased statistics.

The bootstrap-t (studentised) method goes further: it standardises each bootstrap statistic by its own bootstrap standard error, then uses the quantiles of that standardised distribution. This is the most accurate but also the most computationally intensive, requiring a nested bootstrap within each outer resample.

When to use bootstrap and when to use classical methods

Bootstrap resampling shines in situations where classical theory is absent or unreliable. If you need a confidence interval for a correlation coefficient, classical methods require bivariate normality and use Fisher's z-transformation, which can be inaccurate for small samples or heavy-tailed data. Bootstrap requires neither assumption. Similarly, for statistics like the trimmed mean, Spearman's rank correlation, the ratio of two sample means, or the area under an ROC curve, no simple closed-form standard error exists. Bootstrap delivers an interval directly.

Classical methods retain advantages when their assumptions hold. The standard error of the sample mean from a normal population is exactly σ / √n, giving exact intervals even for small samples. The F-test in regression and ANOVA has exact distributions under normality and gives more power than bootstrap for those scenarios. A rule of thumb: use classical theory when assumptions are plausible and established formulas exist; use bootstrap when assumptions are questionable, the statistic is complex, or you simply want a robust check on classical results.

Bootstrap is not a panacea. It estimates sampling variability but cannot compensate for a biased estimator or a sample that does not represent the population. If the data suffer from selection bias, neither bootstrap nor any other resampling method can fix the fundamental problem.

Block bootstrap and dependent data

The standard bootstrap assumes independent and identically distributed observations. This assumption is violated by time series, spatial data, clustered data, or repeated measurements on the same subjects. Naive resampling destroys the correlation structure and produces invalid intervals.

Several variants address dependence. The moving block bootstrap partitions the series into overlapping or non-overlapping blocks of length l and resamples whole blocks rather than individual observations. If the block length is chosen to exceed the correlation range, resampled blocks are approximately independent and the resampled series retains the short-range correlation structure. The stationary bootstrap uses random block lengths drawn from a geometric distribution to preserve stationarity.

For clustered data — pupils within schools, patients within hospitals — the cluster bootstrap resamples entire clusters with replacement, preserving the within-cluster correlation. This is now standard practice in educational and medical research. The choice of resampling unit should match the unit of randomisation: if treatments were assigned at the school level, resample schools, not pupils.

Real-world applications

Bootstrap resampling has become a standard tool across quantitative disciplines.

Frequently Asked Questions

What is bootstrap resampling?

Bootstrap resampling is a computational technique that repeatedly draws samples with replacement from the observed data to approximate the sampling distribution of any statistic, enabling confidence interval construction without distributional assumptions.

What does 'with replacement' mean in bootstrapping?

Sampling with replacement means each data point can be selected more than once in a single bootstrap sample. After drawing one observation, it is returned to the pool, so the same value may appear multiple times in the resample.

How many bootstrap samples do you need?

For rough estimates, 200–500 resamples often suffice. For reliable 95% confidence intervals, 1 000–2 000 is standard. For very accurate tail percentiles such as a 99% CI, 10 000 or more resamples are recommended.

What is the percentile bootstrap confidence interval?

The percentile method takes the 2.5th and 97.5th percentiles of the B bootstrap statistic values as the lower and upper bounds of a 95% confidence interval. It requires no normality assumption and is the simplest bootstrap interval to compute.

What is the BCa bootstrap interval?

The bias-corrected and accelerated (BCa) interval adjusts for bias and skewness in the bootstrap distribution. It applies two correction factors — the bias-correction z0 and the acceleration a — to shift and stretch the percentile cutpoints, giving better coverage than the raw percentile method.

Can bootstrap handle small samples?

Bootstrap works better than many parametric methods on small samples, but it still relies on the sample representing the population. With very few observations (n less than 10), bootstrap intervals can be unreliable because the empirical distribution is a poor proxy for the true distribution.

What is the difference between bootstrap and permutation tests?

Bootstrap resampling estimates uncertainty around an estimate by resampling with replacement. Permutation tests assess statistical significance by shuffling group labels without replacement, generating a null distribution to compute p-values. Both are resampling methods but answer different questions.

When should you NOT use the bootstrap?

Bootstrap can fail when the statistic depends on extreme tails (e.g. sample maximum), when the distribution has infinite variance, or when observations are strongly dependent without using a block bootstrap variant. It also cannot fix bias caused by a non-representative sample.

What is the parametric bootstrap?

The parametric bootstrap fits a distributional model (e.g. normal, Poisson) to the data, then generates synthetic samples from that fitted model rather than resampling from the raw data. It can be more efficient if the assumed model is correct.

How does bootstrapping relate to the central limit theorem?

The central limit theorem guarantees approximate normality for the sample mean at large n, enabling standard error formulas. Bootstrap estimates the sampling distribution directly, making it useful when the CLT applies poorly — small samples, non-normal data, or statistics other than the mean.

Try it yourself

Explore the interactive simulations to build intuition for resampling and related ideas:

Conclusion

Bootstrap resampling is one of the most practically powerful ideas in modern statistics. By replacing mathematical derivation with computational simulation — sampling from the observed data as if it were the population — it makes uncertainty quantification accessible for virtually any estimator, regardless of its distributional complexity. From the sample median to machine learning generalisation error to phylogenetic tree reliability, the bootstrap provides a consistent, principled approach. Its limitations are real — it cannot conjure information the data do not contain, and it requires care with dependent observations — but within its domain it is extraordinarily versatile. As data analysis grows more complex and parametric assumptions grow less defensible, the bootstrap's role in rigorous, assumption-light inference will only continue to expand.