Survey Sampling
Surveys are how epidemiologists estimate population quantities—prevalence, coverage, exposure—without measuring everyone. The sampling design determines both how to draw the sample and how to weight it so estimates generalize to the target population.
Common sampling designs
- Simple random sampling (SRS) — every unit has an equal probability of selection, and every subset of size is equally likely. The baseline against which other designs are compared.
- Stratified sampling — partition the population into strata (e.g., age groups, regions) and sample within each. Guarantees representation of every stratum and, when strata are internally homogeneous, lowers variance.
- Cluster sampling — randomly select groups (clusters, e.g., clinics or villages) and survey units within them. Cheaper for dispersed populations but usually less efficient because units within a cluster are correlated.
- Probability-proportional-to-size (PPS) — select clusters with probability proportional to their size, so larger clusters are more likely to be chosen; combined with weighting this can yield efficient, self-weighting designs.
Weighting and post-stratification
When selection probabilities differ, each sampled unit represents a different number of population units. The design weight is the inverse of the selection probability, . A weighted mean estimates the population mean:
Post-stratification further adjusts weights so the sample’s margins (e.g., by age and sex) match known population totals, correcting for coverage and non-response imbalance.
Missing data and imputation
Non-response leaves gaps. Beyond weighting adjustments, imputation fills in missing values—single imputation (e.g., mean or regression) or, preferably, multiple imputation, which fills each gap several times to propagate the added uncertainty into standard errors.
Worked conceptual example
Suppose a district has 800 urban and 200 rural residents (1000 total) and we take a stratified sample of 80 urban and 20 rural (a 10% sample in each stratum). Selection probabilities are equal at , so weights are for everyone—a self-weighting design in which the unweighted sample mean already estimates the population mean. If instead we oversampled rural residents (say 40 of 200, ) for precision there, their weight would be versus for urban, and we would weight by to avoid over-representing rural values.
In R the survey package handles such designs directly: build a design object and call svymean for design-consistent estimates and standard errors.
In code
Draw an SRS and a stratified sample.
R
set.seed(1)
pop <- data.frame(id = 1:1000,
stratum = rep(c("urban", "rural"), c(800, 200)),
y = c(rnorm(800, 10), rnorm(200, 12)))
# Simple random sample of 100
srs <- pop[sample(nrow(pop), 100), ]
# Stratified: 10% within each stratum
strat <- do.call(rbind, lapply(split(pop, pop$stratum),
$ function(d) d[sample(nrow(d), 0.1 * nrow(d)), ]))
# Design-based estimate with the survey package:
# library(survey)
# des <- svydesign(ids = ~1, strata = ~stratum, fpc = ~rep(...), data = strat)
# svymean(~y, des)
Python
import numpy as np, pandas as pd
rng = np.random.default_rng(1)
pop = pd.DataFrame({
"stratum": ["urban"] * 800 + ["rural"] * 200,
"y": np.concatenate([rng.normal(10, 1, 800), rng.normal(12, 1, 200)]),
})
srs = pop.sample(n=100, random_state=1) # SRS
strat = pop.groupby("stratum", group_keys=False).apply( # stratified 10%
lambda d: d.sample(frac=0.1, random_state=1))
print(srs["y"].mean(), strat["y"].mean())
10.34350267409585 10.312518620234567
Julia
using Random, Statistics
Random.seed!(1)
y = vcat(randn(800) .+ 10, randn(200) .+ 12)
stratum = vcat(fill("urban", 800), fill("rural", 200))
srs = y[randperm(1000)[1:100]] # SRS
urban = findall(==("urban"), stratum)
rural = findall(==("rural"), stratum)
strat = vcat(y[shuffle(urban)[1:80]], y[shuffle(rural)[1:20]]) # stratified 10%
println(mean(srs), " ", mean(strat))
Why it matters for statistics
Ignoring the sampling design biases estimates and understates uncertainty: weights and stratification are not optional bookkeeping but part of the estimator. Sound survey sampling is what lets a few hundred measurements speak validly for an entire population.