Linkage Disequilibrium
Linkage disequilibrium (LD) is the non-random association of alleles at different loci — knowing the allele carried at one site tells you something about the allele at another. It is what lets a genotyping chip “tag” millions of untyped variants with a few hundred thousand markers, and it is the reason a genetic association can point near, but not exactly at, the causal variant.
Defining the association
Consider two biallelic loci with alleles at the first and at the second. Let and be the allele frequencies and the frequency of the haplotype carrying both and . If the alleles were associated at random, we would expect . The coefficient of linkage disequilibrium measures the departure from that independence: When the loci are in linkage equilibrium; the sign and size of capture the direction and strength of the association, much like a covariance between the two alleles treated as random variables.
Normalized measures
The raw is hard to compare across loci because its range depends on the allele frequencies. Lewontin’s rescales by the maximum value could take given those frequencies: where and . The most widely used measure is the squared correlation coefficient between the two loci: Here ranges from (independence) to (perfect correlation, where one locus perfectly predicts the other), and it is directly related to the probability of detecting an association at a tag SNP.
Decay under recombination
Recombination breaks up haplotypes and erodes LD over the generations. If is the recombination fraction between the two loci (the probability of a crossover between them per meiosis), then the disequilibrium decays geometrically: Tightly linked loci (small ) retain LD for many generations, while unlinked loci () lose half of any remaining each generation. This decay is why LD blocks are local: over long times only nearby variants stay correlated.
Worked example
Suppose two loci have haplotype frequencies The allele frequencies follow by summing: so and . The disequilibrium coefficient is Because , , giving . The squared correlation is So the two loci share of their variation: a marker at one would capture a good deal, but not all, of an association at the other.
Why it matters in practice
LD underpins several core methods in statistical genetics.
- Tag SNPs: because nearby variants are correlated, a genotyping array covers the genome by choosing markers in high with their neighbors, and imputation reconstructs untyped variants from LD patterns.
- Fine-mapping: a GWAS association signal is spread across all variants in LD with the causal one, so disentangling them requires modeling the local LD structure.
- Instrument validity: in Mendelian randomization, a genetic instrument can be invalid if it is in LD with a variant affecting the outcome through another pathway.
In code
R
pAB <- 0.4; pAb <- 0.1; paB <- 0.1; pab <- 0.4
pA <- pAB + pAb; pB <- pAB + paB
pa <- 1 - pA; pb <- 1 - pB
D <- pAB - pA * pB # 0.15
Dmax <- if (D > 0) min(pA * pb, pa * pB) else min(pA * pB, pa * pb)
Dprime <- D / Dmax # 0.6
r2 <- D^2 / (pA * pa * pB * pb) # 0.36
c(D = D, Dprime = Dprime, r2 = r2)
# decay of D over 10 generations at recombination fraction c = 0.1
c_rec <- 0.1
D * (1 - c_rec)^(0:10)
Python
pAB, pAb, paB, pab = 0.4, 0.1, 0.1, 0.4
pA = pAB + pAb; pB = pAB + paB
pa = 1 - pA; pb = 1 - pB
D = pAB - pA * pB # 0.15
Dmax = min(pA * pb, pa * pB) if D > 0 else min(pA * pB, pa * pb)
Dprime = D / Dmax # 0.6
r2 = D**2 / (pA * pa * pB * pb) # 0.36
print(D, Dprime, r2)
c_rec = 0.1
[D * (1 - c_rec)**t for t in range(11)] # geometric decay
0.15000000000000002 0.6000000000000001 0.3600000000000001
Julia
pAB, pAb, paB, pab = 0.4, 0.1, 0.1, 0.4
pA = pAB + pAb; pB = pAB + paB
pa = 1 - pA; pb = 1 - pB
D = pAB - pA * pB # 0.15
Dmax = D > 0 ? min(pA*pb, pa*pB) : min(pA*pB, pa*pb)
Dprime = D / Dmax # 0.6
r2 = D^2 / (pA * pa * pB * pb) # 0.36
println((D, Dprime, r2))
c_rec = 0.1
[D * (1 - c_rec)^t for t in 0:10] # decay over generations
Why it matters
Linkage disequilibrium is the statistical glue between genotype and phenotype maps. Its magnitude decides how densely we must genotype, how precisely we can localize a causal variant, and whether a genetic instrument is clean — so quantifying , , and especially is a prerequisite for interpreting essentially any modern association study.