A Simulation Toolkit

Simulation is the statistician’s laboratory: when you can’t derive a result on paper, you can generate fake data from a known process and watch what an estimator does. Because you control the truth, you can check whether a method recovers it. This page collects the practical tools for doing that in R, Python, and Julia.

Start from a known data-generating process

The core move is to invent a data-generating process (DGP) with parameters you choose, draw data from it, and then apply your method. Any gap between your estimate and the known truth is bias, variance, or a bug.

set.seed(1)                      # reproducibility first
n <- 100
beta_true <- 2.0
x <- rnorm(n)
y <- beta_true * x + rnorm(n)    # truth is beta = 2
coef(lm(y ~ x - 1))              # does OLS recover ~2?

Core verbs, across three languages

Most simulation work is some mix of: draw random numbers, repeat many times, optimize, root-find, integrate, and differentiate. Here is the same toolbox in each language.

Task	R	Python (NumPy/SciPy)	Julia
Set seed	`set.seed(1)`	`np.random.default_rng(1)`	`Random.seed!(1)`
Normal draws	`rnorm(n)`	`rng.normal(size=n)`	`randn(n)`
Binomial draws	`rbinom(n, k, p)`	`rng.binomial(k, p, n)`	`rand(Binomial(k, p), n)`
Repeat an expression	`replicate(B, expr)`	list comp / `np.array([... for _ in range(B)])`	`[expr for _ in 1:B]`
Map over inputs	`sapply(xs, f)`	`np.vectorize(f)(xs)` / comprehension	`map(f, xs)` / `f.(xs)`
Minimize (multivar.)	`optim(par, fn)`	`scipy.optimize.minimize`	`optimize(f, x0)` (Optim.jl)
Minimize (1-D)	`optimize(f, interval)`	`scipy.optimize.minimize_scalar`	`optimize(f, a, b)` (Optim.jl)
Find a root	`uniroot(f, interval)`	`scipy.optimize.brentq`	`find_zero(f, x0)` (Roots.jl)
Numerical integral	`integrate(f, a, b)`	`scipy.integrate.quad`	`quadgk(f, a, b)` (QuadGK.jl)
Derivative	`deriv()` / `numDeriv::grad`	`scipy.misc`/autograd, or finite diff	`ForwardDiff.derivative`

import numpy as np
from scipy import optimize, integrate
rng = np.random.default_rng(1)

optimize.minimize_scalar(lambda t: (t - 3)**2)      # minimize
optimize.brentq(lambda t: t**2 - 2, 0, 2)           # root: sqrt(2)
integrate.quad(lambda t: np.exp(-t**2), 0, 1)       # integral

using Random, Distributions, Optim, QuadGK, ForwardDiff, Roots
Random.seed!(1)

optimize(t -> (t[1] - 3)^2, [0.0])                  # minimize
find_zero(t -> t^2 - 2, 1.0)                        # root: sqrt(2)
quadgk(t -> exp(-t^2), 0, 1)                        # integral
ForwardDiff.derivative(t -> t^3, 2.0)               # exact derivative = 12

Worked mini simulation study: coverage of a confidence interval

Question: for the sample mean of $n$ normal observations, does the usual 95% t-interval actually contain the true mean 95% of the time? We can check by simulation.

The recipe is the workhorse pattern — generate, fit, record, repeat:

set.seed(2026)

mu_true    <- 5
n          <- 20
n_sims     <- 10000
conf_level <- 0.95

covered <- replicate(n_sims, {
  y  <- rnorm(n, mean = mu_true, sd = 1)   # 1. generate from known DGP
  ci <- t.test(y, conf.level = conf_level)$conf.int  # 2. fit / estimate
$  ci[1] <= mu_true && mu_true <= ci[2]     # 3. record: did it cover?
})

mean(covered)   # should be ~0.95
#> [1] 0.9503

The same study in Python:

import numpy as np
from scipy import stats

rng = np.random.default_rng(2026)
mu_true, n, n_sims = 5, 20, 10_000

covered = np.empty(n_sims, dtype=bool)
for i in range(n_sims):
    y = rng.normal(mu_true, 1.0, size=n)          # generate
    lo, hi = stats.t.interval(0.95, n - 1,        # fit
                              loc=y.mean(),
                              scale=stats.sem(y))
    covered[i] = lo <= mu_true <= hi              # record

covered.mean()   # ~0.95

And in Julia:

using Random, Statistics, Distributions

Random.seed!(2026)
mu_true, n, n_sims = 5.0, 20, 10_000

covered = falses(n_sims)
for i in 1:n_sims
    y  = rand(Normal(mu_true, 1.0), n)            # generate
    m  = mean(y); se = std(y) / sqrt(n)
    tc = quantile(TDist(n - 1), 0.975)            # fit
    covered[i] = (m - tc*se) <= mu_true <= (m + tc*se)   # record
end

mean(covered)   # ~0.95

Why this works

Repeating the draw-and-estimate loop many times approximates the estimator’s sampling distribution, and the average of an indicator (like “did the interval cover?”) converges to its true probability by the law of large numbers. That is the engine behind every coverage check, power calculation, and bias estimate you’ll run by simulation.

Practical tips:

Seed once, at the top, and report the seed.
Vary one thing at a time (e.g. loop over n) to see how behavior scales.
Start small (n_sims <- 100) to debug, then crank it up for the final numbers.
Vectorize or use replicate/comprehensions rather than growing objects in a loop.

A Simulation Toolkit

Start from a known data-generating process

Core verbs, across three languages

Worked mini simulation study: coverage of a confidence interval

Why this works

Related