Markov Chain Monte Carlo
Many posteriors have no closed form, so we cannot write down their mean or quantiles directly. Markov chain Monte Carlo (MCMC) sidesteps this by drawing samples from the target distribution and summarizing them.
The idea
The goal of Bayesian inference is to learn a posterior , but for most models the normalizing constant is an intractable integral. MCMC builds a Markov chain — a sequence where each state depends only on the previous one — whose stationary distribution is exactly the target . Run the chain long enough and its states behave like (correlated) draws from the posterior, so sample averages estimate posterior expectations. Crucially, the chain only needs the target up to a constant, so the unknown normalizer never has to be computed.
Metropolis–Hastings
The Metropolis–Hastings algorithm turns any proposal mechanism into a valid sampler. From the current state , propose from a proposal density and accept it with probability
where is the (unnormalized) target. If the proposal is rejected, the chain stays at . With a symmetric proposal such as a Normal random walk, and the ratio reduces to : always move to higher density, and sometimes move downhill.
Gibbs sampling is a special case for multi-parameter models: cycle through the parameters, drawing each from its full conditional distribution given the current values of the others, accepting every draw.
Diagnostics
Because samples are correlated and the chain starts away from the target, we check convergence and mixing before trusting the output.
- Trace plots. Plot each parameter against iteration; a well-mixed chain looks like a stationary “fuzzy caterpillar” with no drift.
- Burn-in / warm-up. Discard early iterations before the chain reaches the target region.
- Thinning. Optionally keep every -th draw to reduce storage and autocorrelation (though it discards information).
- Potential scale reduction factor . Run several chains from different starts; compares within-chain to between-chain variance and should be near (values above ~1.01 signal non-convergence).
- Effective sample size. Autocorrelation means draws carry the information of fewer independent ones; the effective sample size quantifies how many.
Modern samplers reduce random-walk inefficiency by using gradient information: Hamiltonian Monte Carlo and its adaptive variant NUTS power Stan, PyMC, and Turing.jl.
Worked example: sampling a Poisson rate
Suppose we observe counts and model them as Poisson with rate , using a prior. This case is actually conjugate — the posterior is with mean — which lets us check the sampler against the truth. We run a random-walk Metropolis sampler on (so proposals stay positive) targeting the unnormalized posterior . After discarding burn-in, the sample mean should land near and a 95% interval near the analytic quantiles .
In code
A Metropolis sampler from scratch, with the analytic answer for comparison.
R
set.seed(1)
y <- c(2, 3, 1, 4, 2); n <- length(y)
# Unnormalized log-posterior for lambda > 0 (Gamma(2,1) prior, Poisson likelihood)
log_post <- function(lam) {
if (lam <= 0) return(-Inf)
sum(dpois(y, lam, log = TRUE)) + dgamma(lam, 2, 1, log = TRUE)
}
N <- 20000; lam <- 1; draws <- numeric(N)
for (i in 1:N) {
prop <- lam * exp(rnorm(1, 0, 0.3)) # random walk on log scale
logA <- log_post(prop) - log_post(lam) + log(prop) - log(lam) # + Jacobian
if (log(runif(1)) < logA) lam <- prop
draws[i] <- lam
}
draws <- draws[-(1:2000)] # burn-in
c(mean = mean(draws), quantile(draws, c(0.025, 0.975)))
# mean ~2.33, ~[1.28, 3.72]; analytic: Gamma(14, 6)
qgamma(c(0.025, 0.975), 14, 6) # 1.28 3.72
# plot(draws, type = "l"); hist(draws) # trace + histogram
Python
import numpy as np
from scipy.stats import poisson, gamma
np.random.seed(1)
y = np.array([2, 3, 1, 4, 2]); n = len(y)
def log_post(lam):
if lam <= 0:
return -np.inf
return poisson.logpmf(y, lam).sum() + gamma.logpdf(lam, 2, scale=1)
N, lam = 20000, 1.0
draws = np.empty(N)
for i in range(N):
prop = lam * np.exp(np.random.normal(0, 0.3)) # random walk on log scale
logA = log_post(prop) - log_post(lam) + np.log(prop) - np.log(lam)
if np.log(np.random.rand()) < logA:
lam = prop
draws[i] = lam
draws = draws[2000:] # burn-in
print(draws.mean(), np.quantile(draws, [0.025, 0.975]))
# ~2.33 [~1.28, ~3.72]
print(gamma.ppf([0.025, 0.975], 14, scale=1/6)) # analytic: 1.28 3.72
2.338860902947516 [1.27653574 3.74541504]
[1.27565505 3.70506599]
Julia
using Distributions, Random, Statistics
Random.seed!(1)
y = [2, 3, 1, 4, 2]; n = length(y)
log_post(lam) = lam <= 0 ? -Inf :
sum(logpdf.(Poisson(lam), y)) + logpdf(Gamma(2, 1), lam)
N, lam = 20000, 1.0
draws = zeros(N)
for i in 1:N
prop = lam * exp(0.3 * randn()) # random walk on log scale
logA = log_post(prop) - log_post(lam) + log(prop) - log(lam)
if log(rand()) < logA
lam = prop
end
draws[i] = lam
end
draws = draws[2001:end] # burn-in
println(mean(draws), " ", quantile(draws, [0.025, 0.975]))
# ~2.33 [~1.28, ~3.72]
println(quantile(Gamma(14, 1/6), [0.025, 0.975])) # 1.28 3.72
In practice you would reach for a probabilistic-programming system — Stan, PyMC, or Turing.jl — which implements NUTS, adapts the proposal, and reports and effective sample size automatically. See the simulation toolkit for sampler tooling and reproducibility for seeding and archiving runs.
Why it matters
MCMC made Bayesian inference practical for realistic models by replacing intractable integrals with samples. It is the computational backbone behind fitting hierarchical models, transmission models, and phylogenies, wherever the posterior cannot be written in closed form. Understanding its diagnostics is what separates a trustworthy fit from a chain that merely looks busy.