Population Structure and F_ST

Population structure means a sample is not one randomly mating population but a mixture of subpopulations with different allele frequencies. Ignoring it is dangerous: it deflates heterozygosity, mimics inbreeding, and — most importantly for epidemiology — creates spurious genotype–disease associations when ancestry happens to correlate with exposure or risk.

The Wahlund effect

Suppose a total population is subdivided into subpopulations that each mate randomly internally but rarely exchange migrants. Within each subpopulation, genotype frequencies follow Hardy–Weinberg equilibrium, but when you pool the subpopulations and compute genotype frequencies as if they were one unit, you find fewer heterozygotes than 2pˉqˉ2\bar p\bar q would predict. This deficit of heterozygotes caused purely by lumping together subpopulations with different allele frequencies is the Wahlund effect. It arises because the variance in allele frequency among subpopulations subtracts directly from the pooled heterozygosity.

Defining F_ST

Wright’s fixation index FSTF_{ST} quantifies how much of the genetic variation is due to differences among subpopulations. Write the expected heterozygosity of a population with allele frequency pp as H=2p(1p)H = 2p(1-p). Define:

Then FST=HTHSHT.F_{ST} = \frac{H_T - H_S}{H_T} . Because subpopulation differentiation always makes HSHTH_S \le H_T, FSTF_{ST} lies between 00 and 11: it is the fraction of total heterozygosity “lost” to structure. The quantity HTHSH_T - H_S is exactly the variance in allele frequency among subpopulations (the Wahlund variance), which is why FSTF_{ST} can be read as a standardized among-group variance.

Interpreting the magnitude

As a rough guide (following Wright), FSTF_{ST} below 0.050.05 indicates little differentiation, 0.050.050.150.15 moderate, 0.150.150.250.25 great, and above 0.250.25 very great differentiation. Human continental groups sit around 0.100.100.150.15; strongly isolated demes or bottlenecked pathogen populations can be far higher.

Worked example

Take two subpopulations of equal size with allele frequencies p1=0.7p_1 = 0.7 and p2=0.3p_2 = 0.3 at a biallelic locus.

Within-subpopulation heterozygosities: H1=2(0.7)(0.3)=0.42,H2=2(0.3)(0.7)=0.42,H_1 = 2(0.7)(0.3) = 0.42, \qquad H_2 = 2(0.3)(0.7) = 0.42, so HS=12(0.42+0.42)=0.42H_S = \tfrac{1}{2}(0.42 + 0.42) = 0.42. The mean allele frequency is pˉ=12(0.7+0.3)=0.5\bar p = \tfrac{1}{2}(0.7 + 0.3) = 0.5, giving HT=2(0.5)(0.5)=0.5.H_T = 2(0.5)(0.5) = 0.5 . Therefore FST=0.50.420.5=0.080.5=0.16.F_{ST} = \frac{0.5 - 0.42}{0.5} = \frac{0.08}{0.5} = 0.16 . An FSTF_{ST} of 0.160.16 signals substantial differentiation: 16%16\% of the total heterozygosity is accounted for by the frequency difference between the two groups.

Why it matters for association studies

Structure is a leading source of confounding in genetic epidemiology.

In code

R

p <- c(0.7, 0.3)             # allele freqs in two equal-sized subpops
het <- function(p) 2 * p * (1 - p)
HS <- mean(het(p))           # 0.42
pbar <- mean(p)              # 0.5
HT <- het(pbar)              # 0.5
FST <- (HT - HS) / HT        # 0.16
c(HS = HS, HT = HT, FST = FST)

Python

import numpy as np
p = np.array([0.7, 0.3])          # two equal-sized subpopulations
het = lambda p: 2 * p * (1 - p)
HS = het(p).mean()                # 0.42
pbar = p.mean()                   # 0.5
HT = 2 * pbar * (1 - pbar)        # 0.5
FST = (HT - HS) / HT              # 0.16
print(HS, HT, FST)
0.42000000000000004 0.5 0.15999999999999992

Julia

p = [0.7, 0.3]                    # two equal-sized subpopulations
het(p) = 2 .* p .* (1 .- p)
HS = mean(het(p))                 # 0.42
pbar = mean(p)                    # 0.5
HT = 2 * pbar * (1 - pbar)        # 0.5
FST = (HT - HS) / HT              # 0.16
println((HS, HT, FST))

Why it matters

FSTF_{ST} turns a qualitative worry — “are my samples really one population?” — into a single number linking heterozygosity, among-group variance, and demographic history. That number is both a lens on migration and isolation and a red flag for confounding, which is why quantifying and adjusting for structure is a non-negotiable step before trusting any genotype–phenotype association.