Statistical Inference
Statistical inference is how we reason from a limited sample back to the population that produced it. Every poll, clinical trial, and epidemiological estimate rests on this leap from “what we saw” to “what is true.”
The core pipeline
The logic of inference flows in a loop:
- Population: the full set of units we care about (all adults in a country, all possible patients, every mosquito in a region). Often it is conceptual or infinite.
- Parameter (): a fixed but unknown number describing the population, such as a mean , a proportion , or a rate .
- Sample: a finite subset of units we actually observe, .
- Statistic / estimator: a function of the sample used to guess the parameter, written . For example the sample mean estimates .
The goal of inference is to use the observable estimate to say something rigorous about the unobservable parameter .
The data-generating process
Behind any dataset we imagine a data-generating process (DGP): a probabilistic mechanism that produces the data. Formally we assume the observations are draws from a distribution indexed by the parameter,
The DGP is a model of reality. Inference asks: given data that plausibly came from , which values of are credible? Choosing a DGP makes the problem tractable and honest about assumptions.
A statistic is random
The crucial insight: because the sample is random, any statistic computed from it is also a random variable. Draw a different sample and you get a different . This sample-to-sample variability is not a nuisance to be ignored; it is exactly what lets us quantify uncertainty.
A good estimator has its distribution centered near (low bias) and tightly concentrated (low variance). The distribution of across hypothetical repeated samples is called its sampling distribution.
Worked example
Suppose the DGP is (adult heights in cm). The parameter of interest is , which in real life we would not know.
We draw a single sample of people and compute . We might get . A different 25 people might give . Neither equals exactly, yet both cluster around it. If we could repeat the sampling many times, the collection of values would average to and have standard deviation .
Simulation
We define a DGP, draw many samples, and watch the sample means scatter around the true parameter.
R
set.seed(1)
mu <- 170; sigma <- 10; n <- 25
# Draw 10,000 samples, compute the mean of each
means <- replicate(10000, mean(rnorm(n, mu, sigma)))
mean(means) # ~170: estimator is (nearly) unbiased
sd(means) # ~2.0: matches sigma / sqrt(n)
Python
import numpy as np
rng = np.random.default_rng(1)
mu, sigma, n = 170, 10, 25
# Each row is a sample; average across columns
samples = rng.normal(mu, sigma, size=(10_000, n))
means = samples.mean(axis=1)
print(means.mean()) # ~170
print(means.std()) # ~2.0 (= sigma / sqrt(n))
169.97170187520274
1.993564883254646
Julia
using Random, Statistics, Distributions
Random.seed!(1)
mu, sigma, n = 170, 10, 25
dgp = Normal(mu, sigma)
means = [mean(rand(dgp, n)) for _ in 1:10_000]
mean(means) # ~170
std(means) # ~2.0 (= sigma / sqrt(n))
Why it matters for statistics
Inference is the foundation of the entire discipline: estimation, hypothesis testing, and confidence intervals all describe the behavior of a random statistic relative to a fixed parameter. Recognizing that has a distribution — not just a value — is what separates a point guess from a scientific claim with quantified uncertainty. In epidemiology, this is how a prevalence estimate from a survey becomes a defensible statement about a whole population.