Hierarchical (Multilevel) Models
Real data are usually nested: patients within hospitals, cases within regions, measurements within individuals. Hierarchical models estimate group-level quantities while letting each group borrow strength from all the others, striking a principled balance between lumping everyone together and treating every group in isolation.
Three ways to handle groups
Suppose you want the disease rate in each of small areas. Complete pooling ignores the grouping and reports one overall rate for everyone, which is stable but hides real between-area differences. No pooling fits a separate estimate to each area’s data alone, which is unbiased but wildly noisy when some areas have only a handful of cases. Partial pooling is the middle path: a multilevel model that estimates each area while assuming the area rates themselves come from a common distribution.
Varying intercepts
The core device is to let a parameter vary by group and give those group-level values their own distribution. A varying-intercept model writes the outcome for observation in group as The group intercepts are drawn from a normal distribution with overall mean and between-group variance , while is the within-group variance. You can extend this to varying slopes, , where the pair is drawn from a shared bivariate distribution so that group-specific effects also pool toward a common trend.
Shrinkage
The partial-pooling estimate for a group is a precision-weighted compromise between that group’s own mean and the overall mean . For a group with observations the estimate is The weight is the reliability of the group’s data: when a group is large or between-group variation is big, and the estimate barely moves from its raw value. When a group is small or noisy, is near zero and the estimate is pulled hard toward the overall mean. This is shrinkage, and it is exactly why the small groups in the figure travel farthest.
Connection to Bayesian inference
A hierarchical model is a Bayesian model in which the group-level distribution acts as a prior on the , and the data update it into a posterior for each group. The shrinkage formula above is the posterior mean of under that prior. When the hyperparameters are themselves estimated from the data (rather than fixed by hand) and then plugged in, the procedure is called empirical Bayes; the closely related James–Stein estimator shrinks several group means toward their grand mean and provably beats the raw means in total squared error. Fully Bayesian versions place priors on the hyperparameters too and are typically fit with MCMC.
Worked example: disease rates across small areas
Imagine measured rates in a dozen districts, where some districts contribute many cases and others only a few. The no-pooling rates for the small districts swing far above and below the truth simply from sampling noise. An empirical-Bayes fit estimates the between-district spread , forms each , and shrinks every raw rate toward the grand mean in proportion to how little data it rests on. The result is a set of stabilized rates that are usually closer to the truth, especially for the sparse districts.
In code
R
set.seed(1)
J <- 12
n <- sample(3:60, J, replace = TRUE)
true <- rnorm(J, 5, 1.6)
raw <- true + rnorm(J, 0, 3 / sqrt(n)) # no-pooling means
grand <- mean(raw)
tau2 <- var(raw) # between-group variance (rough)
w <- tau2 / (tau2 + (3^2) / n) # shrinkage weights
shrunk <- grand + w * (raw - grand) # partial-pooling estimates
data.frame(n, raw = round(raw, 2), shrunk = round(shrunk, 2))
Python
import numpy as np
rng = np.random.default_rng(1)
J = 12
n = rng.integers(3, 60, size=J)
true = rng.normal(5, 1.6, size=J)
sigma = 3.0
raw = true + rng.normal(0, sigma / np.sqrt(n)) # no-pooling means
grand = raw.mean()
tau2 = raw.var(ddof=1) # between-group variance
w = tau2 / (tau2 + sigma**2 / n) # shrinkage weights
shrunk = grand + w * (raw - grand) # partial pooling
print(round(grand, 3)) # grand mean ~4.7
for i in np.argsort(n)[:3]: # smallest groups shrink most
print(n[i], round(raw[i], 2), round(shrunk[i], 2), round(w[i], 2))
# e.g. 4 8.03 5.28 0.19 (raw far out, shrunk pulled back hard)
5.038
4 6.99 5.85 0.42
11 6.79 6.2 0.66
17 4.1 4.33 0.75
Julia
using Random, Statistics
Random.seed!(1)
J = 12
n = rand(3:60, J)
true_a = 5 .+ 1.6 .* randn(J)
sigma = 3.0
raw = true_a .+ (sigma ./ sqrt.(n)) .* randn(J)
grand = mean(raw)
tau2 = var(raw)
w = tau2 ./ (tau2 .+ sigma^2 ./ n)
shrunk = grand .+ w .* (raw .- grand) # partial-pooling estimates
Why it matters
Multilevel models are the standard tool whenever data arrive in clusters, from small-area disease mapping to hospital performance comparisons to repeated measures on the same subject. By shrinking unreliable estimates toward the group average, they avoid overreacting to noise in sparse groups while still surfacing genuine differences, giving more accurate and more honest group-level answers than either extreme of pooling.