Population Structure and F_ST
Population structure means a sample is not one randomly mating population but a mixture of subpopulations with different allele frequencies. Ignoring it is dangerous: it deflates heterozygosity, mimics inbreeding, and — most importantly for epidemiology — creates spurious genotype–disease associations when ancestry happens to correlate with exposure or risk.
The Wahlund effect
Suppose a total population is subdivided into subpopulations that each mate randomly internally but rarely exchange migrants. Within each subpopulation, genotype frequencies follow Hardy–Weinberg equilibrium, but when you pool the subpopulations and compute genotype frequencies as if they were one unit, you find fewer heterozygotes than would predict. This deficit of heterozygotes caused purely by lumping together subpopulations with different allele frequencies is the Wahlund effect. It arises because the variance in allele frequency among subpopulations subtracts directly from the pooled heterozygosity.
Defining F_ST
Wright’s fixation index quantifies how much of the genetic variation is due to differences among subpopulations. Write the expected heterozygosity of a population with allele frequency as . Define:
- : the mean expected heterozygosity within subpopulations, averaged over subpopulations, .
- : the expected heterozygosity of the pooled population, computed from the mean allele frequency , so .
Then Because subpopulation differentiation always makes , lies between and : it is the fraction of total heterozygosity “lost” to structure. The quantity is exactly the variance in allele frequency among subpopulations (the Wahlund variance), which is why can be read as a standardized among-group variance.
Interpreting the magnitude
As a rough guide (following Wright), below indicates little differentiation, – moderate, – great, and above very great differentiation. Human continental groups sit around –; strongly isolated demes or bottlenecked pathogen populations can be far higher.
Worked example
Take two subpopulations of equal size with allele frequencies and at a biallelic locus.
Within-subpopulation heterozygosities: so . The mean allele frequency is , giving Therefore An of signals substantial differentiation: of the total heterozygosity is accounted for by the frequency difference between the two groups.
Why it matters for association studies
Structure is a leading source of confounding in genetic epidemiology.
- If disease prevalence and allele frequency both vary across ancestral groups, pooling them produces a spurious correlation — the mechanism behind population stratification in a GWAS.
- Methods that estimate and adjust for ancestry (principal components computed from genome-wide markers, mixed models) work precisely by controlling the structure that measures; the leading axes are the eigenvectors of the genotype covariance matrix.
- Beyond confounding, across the genome reflects demographic history: migration homogenizes (), while isolation and drift differentiate.
In code
R
p <- c(0.7, 0.3) # allele freqs in two equal-sized subpops
het <- function(p) 2 * p * (1 - p)
HS <- mean(het(p)) # 0.42
pbar <- mean(p) # 0.5
HT <- het(pbar) # 0.5
FST <- (HT - HS) / HT # 0.16
c(HS = HS, HT = HT, FST = FST)
Python
import numpy as np
p = np.array([0.7, 0.3]) # two equal-sized subpopulations
het = lambda p: 2 * p * (1 - p)
HS = het(p).mean() # 0.42
pbar = p.mean() # 0.5
HT = 2 * pbar * (1 - pbar) # 0.5
FST = (HT - HS) / HT # 0.16
print(HS, HT, FST)
0.42000000000000004 0.5 0.15999999999999992
Julia
p = [0.7, 0.3] # two equal-sized subpopulations
het(p) = 2 .* p .* (1 .- p)
HS = mean(het(p)) # 0.42
pbar = mean(p) # 0.5
HT = 2 * pbar * (1 - pbar) # 0.5
FST = (HT - HS) / HT # 0.16
println((HS, HT, FST))
Why it matters
turns a qualitative worry — “are my samples really one population?” — into a single number linking heterozygosity, among-group variance, and demographic history. That number is both a lens on migration and isolation and a red flag for confounding, which is why quantifying and adjusting for structure is a non-negotiable step before trusting any genotype–phenotype association.