The Coalescent
The coalescent looks at a population’s genealogy backward in time: starting from a sample of sequences today, ancestral lineages merge — coalesce — until they reach a single common ancestor. This backward view is the engine of modern pathogen phylodynamics, letting us read effective population sizes, epidemic growth, and divergence times straight off a phylogenetic tree.
Coalescence of two lineages
Consider two sampled gene copies in a Wright–Fisher population with effective size (diploid, gene copies; here we use the haploid convention with copies for cleaner formulas). Going back one generation, the two lineages share a parent with probability . The number of generations until they coalesce is therefore geometric with success probability , and for large it is well approximated by an exponential distribution: So two random lineages typically trace back to a common ancestor about generations ago.
Many lineages
With ancestral lineages, any of the pairs could be the next to coalesce. Each pair coalesces at rate , so the total rate of coalescence is , and the waiting time until the number of lineages drops from to is exponential: Because is large when many lineages remain, early coalescences (many lineages) happen fast and the last one — the merge from two lineages to one — takes by far the longest. The waiting times are independent, a direct use of the memoryless exponential property.
Time to the most recent common ancestor
The time to the most recent common ancestor (TMRCA) of a sample of is the sum of the independent waiting times as the lineage count falls from down to : Notice the striking result: even for very large samples the expected TMRCA approaches , never much more, because the final two-lineage epoch dominates.
Total tree length
The total branch length of the genealogy sums, over each epoch with lineages, that epoch’s duration times the lineages present: The harmonic sum grows only logarithmically, so total tree length — and hence the expected number of neutral mutations, and thus genetic diversity — increases slowly with sample size.
Worked example
Take and a sample of lineages. The expected waiting times per epoch are Summing gives the expected TMRCA: which matches . The expected total tree length is and note how the two-lineage epoch alone ( generations) accounts for two-thirds of the TMRCA.
Simulation
R
set.seed(42)
sim_tmrca <- function(n, Ne) {
total <- 0
for (k in n:2) {
rate <- choose(k, 2) / Ne
total <- total + rexp(1, rate) # waiting time in the k-lineage epoch
}
total
}
tmrca <- replicate(10000, sim_tmrca(4, 1000))
mean(tmrca) # ~ 1500, matching 2*Ne*(1 - 1/n)
Python
import numpy as np
rng = np.random.default_rng(42)
def sim_tmrca(n, Ne):
total = 0.0
for k in range(n, 1, -1):
rate = (k * (k - 1) / 2) / Ne # C(k,2)/Ne
total += rng.exponential(1 / rate) # numpy uses scale = 1/rate
return total
tmrca = np.array([sim_tmrca(4, 1000) for _ in range(10000)])
print(tmrca.mean()) # ~ 1500, matching 2*Ne*(1 - 1/n)
1487.0811141773204
Julia
using Random, Distributions
Random.seed!(42)
function sim_tmrca(n, Ne)
total = 0.0
for k in n:-1:2
rate = (k * (k - 1) / 2) / Ne # C(k,2)/Ne
total += rand(Exponential(1 / rate)) # Exponential takes the scale
end
total
end
tmrca = [sim_tmrca(4, 1000) for _ in 1:10_000]
println(mean(tmrca)) # ~ 1500, matching 2*Ne*(1 - 1/n)
Why it matters
The coalescent is the probabilistic backbone linking sampled sequences to the demographic history that produced them. Because coalescence rates scale inversely with , a genealogy inferred from pathogen genomes encodes the effective population size and its changes over an epidemic, and the same tree — read forward with a mutation rate — underlies the molecular clock dating of divergence events.