Statistical Inference

Statistical inference is how we reason from a limited sample back to the population that produced it. Every poll, clinical trial, and epidemiological estimate rests on this leap from “what we saw” to “what is true.”

The core pipeline

The logic of inference flows in a loop:

Population    Parameter θ    Sample    Estimate θ^\text{Population} \;\to\; \text{Parameter } \theta \;\to\; \text{Sample} \;\to\; \text{Estimate } \hat{\theta}

The goal of inference is to use the observable estimate θ^\hat{\theta} to say something rigorous about the unobservable parameter θ\theta.

The data-generating process

Behind any dataset we imagine a data-generating process (DGP): a probabilistic mechanism that produces the data. Formally we assume the observations are draws from a distribution indexed by the parameter,

X1,,Xniidf(xθ).X_1, \dots, X_n \overset{\text{iid}}{\sim} f(x \mid \theta).

The DGP is a model of reality. Inference asks: given data that plausibly came from f(xθ)f(x\mid\theta), which values of θ\theta are credible? Choosing a DGP makes the problem tractable and honest about assumptions.

A statistic is random

The crucial insight: because the sample is random, any statistic computed from it is also a random variable. Draw a different sample and you get a different θ^\hat{\theta}. This sample-to-sample variability is not a nuisance to be ignored; it is exactly what lets us quantify uncertainty.

A good estimator has its distribution centered near θ\theta (low bias) and tightly concentrated (low variance). The distribution of θ^\hat{\theta} across hypothetical repeated samples is called its sampling distribution.

Worked example

Suppose the DGP is XNormal(μ=170, σ=10)X \sim \text{Normal}(\mu = 170,\ \sigma = 10) (adult heights in cm). The parameter of interest is μ=170\mu = 170, which in real life we would not know.

We draw a single sample of n=25n = 25 people and compute Xˉ\bar{X}. We might get Xˉ=168.4\bar{X} = 168.4. A different 25 people might give Xˉ=171.2\bar{X} = 171.2. Neither equals 170170 exactly, yet both cluster around it. If we could repeat the sampling many times, the collection of Xˉ\bar{X} values would average to μ\mu and have standard deviation σ/n=10/5=2\sigma / \sqrt{n} = 10/5 = 2.

Simulation

We define a DGP, draw many samples, and watch the sample means scatter around the true parameter.

R

set.seed(1)
mu <- 170; sigma <- 10; n <- 25

# Draw 10,000 samples, compute the mean of each
means <- replicate(10000, mean(rnorm(n, mu, sigma)))

mean(means)  # ~170: estimator is (nearly) unbiased
sd(means)    # ~2.0: matches sigma / sqrt(n)

Python

import numpy as np
rng = np.random.default_rng(1)
mu, sigma, n = 170, 10, 25

# Each row is a sample; average across columns
samples = rng.normal(mu, sigma, size=(10_000, n))
means = samples.mean(axis=1)

print(means.mean())  # ~170
print(means.std())   # ~2.0  (= sigma / sqrt(n))
169.97170187520274
1.993564883254646

Julia

using Random, Statistics, Distributions
Random.seed!(1)
mu, sigma, n = 170, 10, 25

dgp = Normal(mu, sigma)
means = [mean(rand(dgp, n)) for _ in 1:10_000]

mean(means)  # ~170
std(means)   # ~2.0  (= sigma / sqrt(n))

Why it matters for statistics

Inference is the foundation of the entire discipline: estimation, hypothesis testing, and confidence intervals all describe the behavior of a random statistic relative to a fixed parameter. Recognizing that θ^\hat{\theta} has a distribution — not just a value — is what separates a point guess from a scientific claim with quantified uncertainty. In epidemiology, this is how a prevalence estimate from a survey becomes a defensible statement about a whole population.