Instrumental Variables

When an unmeasured confounder distorts the relationship between an exposure and an outcome, ordinary regression estimates the wrong thing. Instrumental variables (IV) exploit a special variable ZZ to recover the causal effect even when confounding cannot be measured or adjusted for.

The problem: confounding bias

Suppose we want the causal effect β\beta of an exposure XX on an outcome YY, but an unmeasured confounder UU influences both:

Y=βX+U+ε,X=αU+(other causes).Y = \beta X + U + \varepsilon, \qquad X = \alpha U + \text{(other causes)}.

Because XX and the error term share UU, the regressor is correlated with the disturbance (Cov(X,U)0\operatorname{Cov}(X, U) \ne 0). Ordinary least squares (OLS) is then biased and inconsistent: it estimates a mixture of the causal effect and the confounding association, not β\beta.

The instrument and its three assumptions

An instrument ZZ is a variable that lets us isolate the part of XX that is “as good as randomly assigned.” It must satisfy three assumptions:

  1. Relevance. ZZ is associated with the exposure: Cov(Z,X)0\operatorname{Cov}(Z, X) \ne 0. This is testable.
  2. Independence (exogeneity). ZZ is independent of the confounders: Cov(Z,U)=0\operatorname{Cov}(Z, U) = 0. Not directly testable.
  3. Exclusion restriction. ZZ affects YY only through XX — there is no direct path ZYZ \to Y. Not directly testable.

Intuitively, ZZ nudges XX without touching UU or YY by any other route, so the induced change in YY can be attributed to XX alone.

Estimators

The Wald ratio

With a single instrument, take covariances of the outcome equation with ZZ. Under independence and exclusion the confounder and direct terms drop out:

Cov(Z,Y)=βCov(Z,X)+Cov(Z,U)=0+Cov(Z,ε)=0.\operatorname{Cov}(Z, Y) = \beta \,\operatorname{Cov}(Z, X) + \underbrace{\operatorname{Cov}(Z, U)}_{=\,0} + \underbrace{\operatorname{Cov}(Z, \varepsilon)}_{=\,0}.

Solving for β\beta gives the Wald (ratio) estimator:

β^=Cov(Z,Y)Cov(Z,X).\hat\beta = \frac{\operatorname{Cov}(Z, Y)}{\operatorname{Cov}(Z, X)}.

For a binary instrument this equals the difference in mean outcome divided by the difference in mean exposure across the two groups.

Two-stage least squares (2SLS)

With one or more instruments, the standard estimator is 2SLS:

With a single instrument, 2SLS is algebraically identical to the Wald ratio. In matrix form, with instrument matrix ZZ,

β^2SLS=(X^X^)1X^Y,X^=Z(ZZ)1ZX.\hat\beta_{2\text{SLS}} = \left(\hat X^\top \hat X\right)^{-1} \hat X^\top Y, \qquad \hat X = Z (Z^\top Z)^{-1} Z^\top X.

Weak-instrument bias

If relevance is only barely satisfied (Cov(Z,X)\operatorname{Cov}(Z,X) near zero), the denominator is small and estimates become unstable, biased toward the OLS estimate, with poor confidence-interval coverage. A common rule of thumb is a first-stage FF-statistic above 10; weaker instruments demand caution.

Worked simulation

We generate data where OLS is badly confounded, then show that 2SLS recovers the true effect β=2\beta = 2.

R

set.seed(1)
n <- 5000
U <- rnorm(n)                          # unmeasured confounder
Z <- rnorm(n)                          # instrument
X <- 0.8 * Z + 1.0 * U + rnorm(n)      # exposure depends on Z and U
beta <- 2
Y <- beta * X + 2.0 * U + rnorm(n)     # U confounds X and Y

# Naive OLS: biased upward (U inflates the X-Y association)
coef(lm(Y ~ X))["X"]                   # ~ 2.66

# Wald ratio / manual 2SLS
cov(Z, Y) / cov(Z, X)                  # ~ 2.00

# 2SLS via AER
# install.packages("AER")
library(AER)
coef(ivreg(Y ~ X | Z))["X"]           # ~ 2.00

Python

import numpy as np
from linearmodels.iv import IV2SLS
import statsmodels.api as sm

rng = np.random.default_rng(1)
n = 5000
U = rng.normal(size=n)
Z = rng.normal(size=n)
X = 0.8 * Z + 1.0 * U + rng.normal(size=n)
beta = 2
Y = beta * X + 2.0 * U + rng.normal(size=n)

# Naive OLS: biased
sm.OLS(Y, sm.add_constant(X)).fit().params[1]      # ~ 2.66

# Wald ratio / manual 2SLS
np.cov(Z, Y)[0, 1] / np.cov(Z, X)[0, 1]            # ~ 2.00

# 2SLS via linearmodels: IV2SLS(dependent, exog, endog, instruments)
res = IV2SLS(Y, np.ones(n), X, Z).fit()
res.params["endog"]                                 # ~ 2.00

Julia

using Random, Statistics

Random.seed!(1)
n = 5000
U = randn(n)
Z = randn(n)
X = 0.8 .* Z .+ 1.0 .* U .+ randn(n)
beta = 2
Y = beta .* X .+ 2.0 .* U .+ randn(n)

# Naive OLS slope: biased (~2.66)
Xo = hcat(ones(n), X)
(Xo \ Y)[2]

# Wald ratio (~2.00)
cov(Z, Y) / cov(Z, X)

# Manual 2SLS via least squares (\)
Zm  = hcat(ones(n), Z)
Xhat = Zm * (Zm \ X)            # stage 1 fitted exposure
Xh  = hcat(ones(n), Xhat)
(Xh \ Y)[2]                     # stage 2 slope ~ 2.00

Why it matters for statistics

Instrumental variables extend causal estimation beyond the reach of adjustment: they identify effects when the confounders are unknown or unmeasured, which is the usual predicament in observational epidemiology and economics. Understanding the relevance, independence, and exclusion assumptions — and their untestability — is central to judging when an IV analysis is credible, and it underpies related designs such as Mendelian randomization and natural experiments.