Sampling Distributions

A statistic computed from a sample is itself random: draw a new sample and you get a new value. The distribution of that statistic over repeated samples — its sampling distribution — is the bridge from a single estimate to a statement about uncertainty.

The sampling distribution of the mean narrows as the sample size <span class=nn increases (standard error σ/n\sigma/\sqrt{n})." />

The idea

Fix a population and a sample size nn. Any statistic, say the sample mean Xˉ\bar{X}, changes from sample to sample. If you could repeatedly draw fresh samples of size nn and recompute Xˉ\bar{X} each time, the histogram of those values is the sampling distribution of Xˉ\bar{X}.

It is a distribution of an estimator, not of raw data. Its spread measures how precise the estimate is.

Sampling distribution of the mean

For an iid sample from a population with mean μ\mu and standard deviation σ\sigma, the sample mean has E[Xˉ]=μ,Var(Xˉ)=σ2n,SD(Xˉ)=σn.\mathbb{E}[\bar{X}] = \mu, \qquad \operatorname{Var}(\bar{X}) = \frac{\sigma^2}{n}, \qquad \operatorname{SD}(\bar{X}) = \frac{\sigma}{\sqrt{n}}. So Xˉ\bar{X} is centered on the truth μ\mu (it is unbiased) and its standard deviation — the standard error — shrinks like 1/n1/\sqrt{n}. Larger samples give tighter, more reliable estimates.

Worked example

A population has μ=50\mu = 50 and σ=10\sigma = 10. For samples of size n=25n = 25: E[Xˉ]=50,SD(Xˉ)=1025=2.\mathbb{E}[\bar{X}] = 50, \qquad \operatorname{SD}(\bar{X}) = \frac{10}{\sqrt{25}} = 2. Quadrupling the sample to n=100n = 100 halves the standard error to 10/100=110/\sqrt{100} = 1.

Simulation

Draw many samples, compute a mean from each, and inspect the distribution of those means. Its spread should match σ/n\sigma/\sqrt{n} and narrow as nn grows.

R

set.seed(7)
mu <- 50; sigma <- 10
for (n in c(25, 100)) {
  means <- replicate(10000, mean(rnorm(n, mu, sigma)))
  cat("n =", n, " mean of means =", round(mean(means), 2),
      " SD of means =", round(sd(means), 3),
      " theory SE =", round(sigma / sqrt(n), 3), "\n")
}

Python

import numpy as np
np.random.seed(7)

mu, sigma = 50, 10
for n in (25, 100):
    means = np.array([np.random.normal(mu, sigma, n).mean()
                      for _ in range(10000)])
    print(f"n={n:>3} mean={means.mean():.2f} "
          f"SD={means.std(ddof=1):.3f} theory={sigma/np.sqrt(n):.3f}")
n= 25 mean=50.00 SD=1.993 theory=2.000
n=100 mean=50.00 SD=1.006 theory=1.000

Julia

using Random, Statistics
Random.seed!(7)

mu, sigma = 50.0, 10.0
for n in (25, 100)
    means = [mean(randn(n) .* sigma .+ mu) for _ in 1:10000]
    println("n=$n mean=", round(mean(means), digits=2),
$            " SD=", round(std(means), digits=3),
            " theory=", round(sigma / sqrt(n), digits=3))
end

Why it matters for statistics

Every standard error, pp-value, and confidence interval is a statement about a sampling distribution. Understanding that a statistic has a distribution — with a known center and a spread that shrinks with nn — is what turns a lone number into statistical inference. The central limit theorem tells us the shape of that distribution for means.