Linkage Disequilibrium

Linkage disequilibrium (LD) is the non-random association of alleles at different loci — knowing the allele carried at one site tells you something about the allele at another. It is what lets a genotyping chip “tag” millions of untyped variants with a few hundred thousand markers, and it is the reason a genetic association can point near, but not exactly at, the causal variant.

Defining the association

Consider two biallelic loci with alleles A/aA/a at the first and B/bB/b at the second. Let pAp_A and pBp_B be the allele frequencies and pABp_{AB} the frequency of the haplotype carrying both AA and BB. If the alleles were associated at random, we would expect pAB=pApBp_{AB} = p_A p_B. The coefficient of linkage disequilibrium measures the departure from that independence: D=pABpApB.D = p_{AB} - p_A p_B . When D=0D = 0 the loci are in linkage equilibrium; the sign and size of DD capture the direction and strength of the association, much like a covariance between the two alleles treated as 0/10/1 random variables.

Normalized measures

The raw DD is hard to compare across loci because its range depends on the allele frequencies. Lewontin’s DD' rescales by the maximum value DD could take given those frequencies: D=DDmax,Dmax={min(pApb,  papB)D>0,min(pApB,  papb)D<0,D' = \frac{D}{D_{\max}}, \qquad D_{\max} = \begin{cases} \min(p_A p_b,\; p_a p_B) & D > 0,\\[2pt] \min(p_A p_B,\; p_a p_b) & D < 0, \end{cases} where pa=1pAp_a = 1 - p_A and pb=1pBp_b = 1 - p_B. The most widely used measure is the squared correlation coefficient between the two loci: r2=D2pApapBpb.r^2 = \frac{D^2}{p_A\, p_a\, p_B\, p_b} . Here r2r^2 ranges from 00 (independence) to 11 (perfect correlation, where one locus perfectly predicts the other), and it is directly related to the probability of detecting an association at a tag SNP.

Decay under recombination

Recombination breaks up haplotypes and erodes LD over the generations. If cc is the recombination fraction between the two loci (the probability of a crossover between them per meiosis), then the disequilibrium decays geometrically: Dt=D0(1c)t.D_t = D_0 (1 - c)^t . Tightly linked loci (small cc) retain LD for many generations, while unlinked loci (c=0.5c = 0.5) lose half of any remaining DD each generation. This decay is why LD blocks are local: over long times only nearby variants stay correlated.

Worked example

Suppose two loci have haplotype frequencies pAB=0.4,pAb=0.1,paB=0.1,pab=0.4.p_{AB} = 0.4,\quad p_{Ab} = 0.1,\quad p_{aB} = 0.1,\quad p_{ab} = 0.4 . The allele frequencies follow by summing: pA=pAB+pAb=0.5,pB=pAB+paB=0.5,p_A = p_{AB} + p_{Ab} = 0.5, \qquad p_B = p_{AB} + p_{aB} = 0.5, so pa=0.5p_a = 0.5 and pb=0.5p_b = 0.5. The disequilibrium coefficient is D=pABpApB=0.4(0.5)(0.5)=0.15.D = p_{AB} - p_A p_B = 0.4 - (0.5)(0.5) = 0.15 . Because D>0D > 0, Dmax=min(pApb,papB)=min(0.25,0.25)=0.25D_{\max} = \min(p_A p_b, p_a p_B) = \min(0.25, 0.25) = 0.25, giving D=0.15/0.25=0.6D' = 0.15 / 0.25 = 0.6. The squared correlation is r2=0.152(0.5)(0.5)(0.5)(0.5)=0.02250.0625=0.36.r^2 = \frac{0.15^2}{(0.5)(0.5)(0.5)(0.5)} = \frac{0.0225}{0.0625} = 0.36 . So the two loci share 36%36\% of their variation: a marker at one would capture a good deal, but not all, of an association at the other.

Why it matters in practice

LD underpins several core methods in statistical genetics.

In code

R

pAB <- 0.4; pAb <- 0.1; paB <- 0.1; pab <- 0.4
pA <- pAB + pAb; pB <- pAB + paB
pa <- 1 - pA;    pb <- 1 - pB

D <- pAB - pA * pB                                  # 0.15
Dmax <- if (D > 0) min(pA * pb, pa * pB) else min(pA * pB, pa * pb)
Dprime <- D / Dmax                                  # 0.6
r2 <- D^2 / (pA * pa * pB * pb)                      # 0.36
c(D = D, Dprime = Dprime, r2 = r2)

# decay of D over 10 generations at recombination fraction c = 0.1
c_rec <- 0.1
D * (1 - c_rec)^(0:10)

Python

pAB, pAb, paB, pab = 0.4, 0.1, 0.1, 0.4
pA = pAB + pAb; pB = pAB + paB
pa = 1 - pA;    pb = 1 - pB

D = pAB - pA * pB                                    # 0.15
Dmax = min(pA * pb, pa * pB) if D > 0 else min(pA * pB, pa * pb)
Dprime = D / Dmax                                    # 0.6
r2 = D**2 / (pA * pa * pB * pb)                       # 0.36
print(D, Dprime, r2)

c_rec = 0.1
[D * (1 - c_rec)**t for t in range(11)]              # geometric decay
0.15000000000000002 0.6000000000000001 0.3600000000000001

Julia

pAB, pAb, paB, pab = 0.4, 0.1, 0.1, 0.4
pA = pAB + pAb; pB = pAB + paB
pa = 1 - pA;    pb = 1 - pB

D = pAB - pA * pB                                     # 0.15
Dmax = D > 0 ? min(pA*pb, pa*pB) : min(pA*pB, pa*pb)
Dprime = D / Dmax                                    # 0.6
r2 = D^2 / (pA * pa * pB * pb)                        # 0.36
println((D, Dprime, r2))

c_rec = 0.1
[D * (1 - c_rec)^t for t in 0:10]                    # decay over generations

Why it matters

Linkage disequilibrium is the statistical glue between genotype and phenotype maps. Its magnitude decides how densely we must genotype, how precisely we can localize a causal variant, and whether a genetic instrument is clean — so quantifying DD, DD', and especially r2r^2 is a prerequisite for interpreting essentially any modern association study.