Population Stratification and PCA Control
The most notorious confounder in association studies is ancestry: if the groups being compared differ in genetic background, the genome lights up with associations that have nothing to do with the trait. Detecting and correcting this “population stratification” is what makes a GWAS trustworthy.
The confounding mechanism
Population stratification arises when three things line up: the sample contains subpopulations, those subpopulations differ in allele frequency, and they also differ in the trait for reasons unrelated to that allele. Then any SNP whose frequency tracks ancestry becomes correlated with the trait, producing a spurious association. The classic illustration is a “chopsticks gene”: in a sample mixing two ancestries, any variant more common in one group appears associated with chopstick use, purely through ancestry as the confounder. This is exactly the confounding structure that good experimental design tries to break by balancing groups.
Detecting inflation
If stratification is inflating associations, test statistics across the whole genome are systematically too large. This shows up as an early, uniform lift of the QQ plot off its diagonal, quantified by the genomic inflation factor
the ratio of the observed median chi-square statistic to the null median expected for one degree of freedom. A clean study has ; values well above (say at moderate sample sizes) flag confounding that must be corrected before trusting any hit.
Correcting with principal components
The standard correction is to include the top principal components of the genotype matrix as covariates in every SNP regression. Stack the standardized genotypes into an matrix (people by SNPs, each column centered and scaled). The principal components are the leading eigenvectors of the genotype covariance (equivalently the top singular vectors of ):
Because ancestry is the largest source of correlated allele-frequency variation, the leading eigenvectors are precisely the axes of ancestry, and each person’s scores on them place them along those axes. Adding the top few PCs as covariates lets the regression absorb ancestry, leaving to reflect within-ancestry association only. This is why stratification correction is, at heart, an eigenvalue/eigenvector computation on the genotype matrix — the same population structure that PCA reveals is what it removes.
Simulated example
Simulate two subpopulations of 300 people each. A causal-looking SNP has effect-allele frequency in population A and in population B, and the trait mean is higher in A for reasons unrelated to genotype. Regressing the trait on this SNP alone yields a strongly “significant” but entirely spurious association; adding the first principal component of the genome-wide genotypes as a covariate collapses it back toward null.
R
set.seed(42)
n <- 600; pop <- rep(c("A", "B"), each = 300)
# Genome-wide SNPs with ancestry-differentiated frequencies
freq <- ifelse(pop == "A", 0.7, 0.2)
G <- sapply(1:200, function(j) rbinom(n, 2, ifelse(pop=="A",
runif(1,.5,.8), runif(1,.1,.4))))
snp <- rbinom(n, 2, freq) # candidate SNP, no true effect
y <- ifelse(pop == "A", 1, 0) + rnorm(n) # trait differs by pop only
summary(lm(y ~ snp))$coefficients["snp", 4] # tiny p: SPURIOUS
$
pcs <- prcomp(scale(G))$x[, 1:2] # top 2 PCs = ancestry
$summary(lm(y ~ snp + pcs))$coefficients["snp", 4] # p ~ non-significant
$```
### Python
```python
import numpy as np, statsmodels.api as sm
from sklearn.decomposition import PCA
rng = np.random.default_rng(42)
n = 600; pop = np.array([0]*300 + [1]*300) # 0=A, 1=B
G = np.column_stack([rng.binomial(2, np.where(pop==0,
rng.uniform(.5,.8), rng.uniform(.1,.4))) for _ in range(200)]).astype(float)
snp = rng.binomial(2, np.where(pop==0, 0.7, 0.2)).astype(float) # no true effect
y = (pop == 0).astype(float) + rng.normal(size=n) # trait differs by pop
Zs = (G - G.mean(0)) / G.std(0)
print(sm.OLS(y, sm.add_constant(snp)).fit().pvalues[1]) # tiny: spurious
pcs = PCA(n_components=2).fit_transform(Zs)
X = sm.add_constant(np.column_stack([snp, pcs]))
print(sm.OLS(y, X).fit().pvalues[1]) # non-significant
4.130440298128386e-11
0.8336821049074644
Julia
using Statistics, GLM, DataFrames, LinearAlgebra, Random, Distributions
Random.seed!(42)
n = 600; pop = vcat(fill(0, 300), fill(1, 300))
G = Float64.(hcat([rand.(Binomial.(2, ifelse.(pop.==0, 0.65, 0.25))) for _ in 1:200]...))
snp = Float64.(rand.(Binomial.(2, ifelse.(pop.==0, 0.7, 0.2)))) # no true effect
y = Float64.(pop .== 0) .+ randn(n)
df = DataFrame(y=y, snp=snp)
println(coeftable(lm(@formula(y ~ snp), df)).cols[4][2]) # tiny p: spurious
Z = (G .- mean(G, dims=1)) ./ std(G, dims=1)
pcs = eigen(Symmetric(cov(Z))).vectors[:, end:-1:end-1] # top 2 eigenvectors
scores = Z * pcs
df2 = DataFrame(y=y, snp=snp, pc1=scores[:,1], pc2=scores[:,2])
println(coeftable(lm(@formula(y ~ snp + pc1 + pc2), df2)).cols[4][2]) # non-sig
Why it matters
Uncontrolled stratification produced a generation of unreplicable candidate-gene claims; PCA-based correction is the standard that helped GWAS become highly reproducible. The idea generalizes: whenever a high-dimensional dataset hides latent group structure, projecting out the leading eigenvectors is a principled way to adjust for it. Reporting and PC-adjusted results is now a basic expectation of credible genetic analysis.