Statistical Foundations
Sample Size Determination
How to compute the number of observations needed to estimate means, proportions, and treatment effects with specified precision and power, including corrections for finite populations and complex designs.
Why This Matters
Before collecting data, you need to know how much to collect. Too little and your estimates are imprecise or your test has no power. Too much and you waste resources. Sample size determination answers the question: what is the minimum that achieves the precision or power you need?
This is not just a survey statistics problem. A/B tests, clinical trials, and ML evaluation benchmarks all require sample size calculations. If you run an A/B test with too few users, you will fail to detect real effects. If you evaluate a model on too few test points, your accuracy estimate has wide confidence intervals.
Mental Model
Sample size formulas all follow the same logic. You specify how precise you want your estimate to be (the margin of error ) or how small an effect you want to detect (the effect size ). You specify a confidence level or significance level. Then you solve for .
The key tradeoff: smaller margin of error requires larger , and grows as . Halving your margin of error quadruples your sample size.
Formal Setup
Sample Size for Estimating a Mean
Sample Size for a Mean
To estimate a population mean with margin of error at confidence level , using an SRS from an infinite (or very large) population:
where is the upper quantile of the standard normal and is the population variance.
For 95% confidence (), this simplifies to .
The formula comes directly from the requirement that the half-width of the confidence interval for be at most : we need .
Sample Size for Estimating a Proportion
Sample Size for a Proportion
To estimate a population proportion with margin of error at confidence level :
Since is unknown before the study, use the conservative choice , which maximizes . This gives the largest (most conservative) sample size:
For 95% confidence and : .
Finite Population Correction
Finite Population Correction (FPC)
When the population size is not much larger than the sample size, apply the finite population correction:
where is the sample size from the infinite-population formula. This is always smaller than . When is large relative to , the correction is negligible.
Design Effect Adjustment
For complex sampling designs (stratified, cluster, multi-stage), the effective sample size differs from the nominal sample size. Multiply the SRS-based sample size by the design effect:
A cluster design with DEFF = 2 requires twice as many observations as SRS to achieve the same precision.
Power Analysis
Power of a Test
The power of a hypothesis test is , where is the probability of a Type II error (failing to reject the null when the alternative is true). Power depends on the sample size, the significance level , the effect size, and the population variance.
Sample Size for a Two-Sample t-Test
To detect a difference between two group means with significance level and power , assuming equal variances and equal group sizes:
For significance level 0.05 and power 0.80:
Main Theorems
Minimum Sample Size for Mean Estimation
Statement
Let be an SRS from a population with mean and variance . For the confidence interval to have half-width at most , it is necessary and sufficient that:
Intuition
The width of the confidence interval shrinks as . To make the width , you solve for . The result scales linearly with variance and inversely with the square of the margin of error.
Proof Sketch
The confidence interval for under SRS has half-width . Set this equal to and solve: implies implies .
Why It Matters
This is the foundation for all sample size planning. Every other formula (for proportions, for power analysis, for complex designs) is a variation on this basic calculation. The dependence is the key quantitative insight: precision is expensive.
Failure Mode
The formula requires knowing before the study. In practice, must be estimated from pilot data, prior studies, or conservative guesses. If is underestimated, the actual margin of error will exceed . The normal approximation also requires to be large enough for the CLT to apply. For highly skewed populations, larger is needed.
Why "More Data" Has Diminishing Returns
The standard error decreases as . Going from to halves the standard error. Going from to halves it again. Each halving of the standard error requires quadrupling the sample size. At some point, the cost of additional data exceeds the value of the incremental precision.
This is why sample size determination matters: it tells you the point beyond which additional data is not worth collecting.
Common Confusions
Sample size depends on population variance, not population size
A common misconception is that you need to sample a fixed percentage of the population (e.g., 10%). The sample size formula depends on and , not on (except through the FPC, which is negligible when ). A poll of 1000 people gives roughly the same precision whether the population is 1 million or 300 million.
Power is not the same as significance
Significance level controls the false positive rate. Power controls the false negative rate. You can have a "significant" result (small -value) with low power if the effect happened to be large in your sample. Planning for power means ensuring you can detect a specified effect size.
The conservative p = 0.5 is sometimes very conservative
Using for proportion estimation gives the maximum sample size. If you have prior knowledge that (a rare event), then , which is 5 times smaller than 0.25. Using would give you 5 times too many samples. Use prior information when available.
Summary
- Sample size for means: . For proportions:
- Halving the margin of error quadruples the required sample size
- Power analysis for a two-sample test:
- Finite population correction matters only when is not small
- Design effects from complex surveys multiply the required sample size
- The formula requires prior knowledge of (or ), which is the hard part
Exercises
Problem
You want to estimate the average income in a city to within 2000 USD at 95% confidence. A pilot study suggests the standard deviation is about 15,000. What sample size do you need? If the city has 50,000 residents, how does the FPC change your answer?
Problem
You are designing an A/B test. You expect the control group conversion rate to be and want to detect a lift to with 80% power at . How many users per group do you need? What if you want to detect a lift to ?
References
Canonical:
- Cochran, Sampling Techniques (1977), Chapter 4
- Cohen, Statistical Power Analysis for the Behavioral Sciences (1988), Chapters 2, 6
Current:
-
Lohr, Sampling: Design and Analysis (2021), Chapter 2
-
Ryan, Sample Size Determination and Power (2013), Chapters 1-4
-
Casella & Berger, Statistical Inference (2002), Chapters 5-10
-
Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6
Next Topics
- Survey sampling methods: the designs that generate the data
- Types of bias in statistics: what goes wrong when sampling is non-random
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Hypothesis Testing for MLLayer 2
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A