Hardy–Weinberg Equilibrium
Hardy–Weinberg equilibrium (HWE) is the null model of population genetics: it says how genotype frequencies relate to allele frequencies when nothing interesting is happening. Deviations from it are how we detect inbreeding, natural selection, hidden population structure, and — very practically — genotyping errors in a sequencing pipeline.
The equilibrium
Consider a single biallelic locus with alleles and . Let be the frequency of allele and the frequency of allele . Under random mating, an individual’s two alleles are like two independent draws from the allele pool, so the genotype frequencies are the terms of : These are the Hardy–Weinberg proportions, and they are reached after a single generation of random mating regardless of the starting genotype frequencies. Once reached, both the allele frequencies and the genotype frequencies stay constant generation after generation — hence “equilibrium”.
Assumptions
The result holds when the idealizing assumptions of the model are met.
- Random mating (no assortative mating or inbreeding).
- No selection: all genotypes have equal survival and fertility.
- No mutation changing one allele into another.
- No migration (gene flow) from populations with different allele frequencies.
- A large population, so that genetic drift does not perturb the frequencies by chance sampling.
Because the model treats an individual as two independent allele draws, the genotype probability for heterozygotes carries the factor : the ordered outcomes and are both heterozygous.
Testing for HWE
Given observed genotype counts in a sample of individuals, we can test whether the population is consistent with HWE.
Estimating the allele frequency
Each individual carries two alleles, so with observed counts , , (summing to ) the allele-frequency estimate is
The chi-square goodness-of-fit test
Under the null hypothesis of HWE, the expected counts are , , and . The Pearson chi-square statistic compares observed and expected counts: There are three genotype categories, but we lose one degree of freedom for the total count constraint and one more for estimating from the data, leaving degree of freedom for a biallelic locus. A large statistic (small p-value relative to the distribution) is evidence against HWE.
Worked example
Suppose we genotype individuals and observe , , .
First estimate the allele frequency: Then the expected counts under HWE are The chi-square statistic is Against , the critical value is , so is highly significant (). The sample has far too few heterozygotes and too many homozygotes — the classic signature of a heterozygote deficit.
What deviations mean
A significant departure from HWE is a signal, not a diagnosis, and the direction is informative.
- Heterozygote deficit (as above) commonly indicates inbreeding, a Wahlund effect from pooling structured subpopulations (see population structure), or genotyping error such as allele dropout.
- Heterozygote excess can indicate outbreeding, overdominant selection, or contamination.
- In genome-wide data, HWE filtering is a routine quality-control step: markers that fail HWE badly in controls usually reflect assay artefacts and are removed before running a GWAS.
In code
R
obs <- c(AA = 90, Aa = 60, aa = 50)
N <- sum(obs)
phat <- (2 * obs["AA"] + obs["Aa"]) / (2 * N) # 0.6
qhat <- 1 - phat
exp_freq <- c(AA = phat^2, Aa = 2 * phat * qhat, aa = qhat^2)
expected <- exp_freq * N # 72, 96, 32
chisq <- sum((obs - expected)^2 / expected) # 28.125
pval <- pchisq(chisq, df = 1, lower.tail = FALSE) # ~ 1.1e-07
c(chisq = chisq, pval = pval)
# chisq.test uses df = 2 by default (does not know p was estimated),
# so compute the 1-df p-value manually as above.
Python
import numpy as np
from scipy import stats
obs = np.array([90, 60, 50]) # AA, Aa, aa
N = obs.sum()
phat = (2 * obs[0] + obs[1]) / (2 * N) # 0.6
qhat = 1 - phat
expected = np.array([phat**2, 2 * phat * qhat, qhat**2]) * N # [72, 96, 32]
chisq = np.sum((obs - expected)**2 / expected) # 28.125
pval = stats.chi2.sf(chisq, df=1) # ~ 1.1e-07
print(chisq, pval)
28.124999999999993 1.1372725656979712e-07
Julia
using Distributions
obs = [90, 60, 50] # AA, Aa, aa
N = sum(obs)
phat = (2obs[1] + obs[2]) / (2N) # 0.6
qhat = 1 - phat
expected = [phat^2, 2phat*qhat, qhat^2] .* N # [72.0, 96.0, 32.0]
chisq = sum((obs .- expected).^2 ./ expected) # 28.125
pval = ccdf(Chisq(1), chisq) # ~ 1.1e-7
println((chisq, pval))
Why it matters
Hardy–Weinberg equilibrium is the reference point against which almost every population-genetic observation is measured. Because it converts allele frequencies into expected genotype frequencies under a clean set of assumptions, any deviation localizes an interesting force — mating structure, selection, or subdivision — and its routine use as a quality-control filter keeps spurious markers out of downstream association analyses.