Genome-Wide Association Studies
A genome-wide association study (GWAS) scans millions of genetic variants across the genome to find those statistically associated with a trait or disease. It is the workhorse of modern genetic epidemiology, turning a table of genotypes and phenotypes into a map of the genome’s contribution to human variation.
The basic test
The unit of a GWAS is a single-nucleotide polymorphism (SNP): a position where individuals carry one of two alleles. For each person we encode the genotype as an allele dosage , the number of copies of the effect allele that individual carries at SNP . For each SNP we fit a regression of the phenotype on that dosage plus covariates.
For a quantitative trait (height, blood pressure, LDL cholesterol) we use linear regression:
where the are covariates (age, sex, ancestry components). For a case/control phenotype we use logistic regression, modeling the log-odds of being a case:
The additive coding means is the per-allele effect: the expected change in the trait (or log-odds) for each extra copy of the effect allele.
Effect estimates and p-values
Each SNP regression returns an estimate , its standard error , and a Wald test statistic
from which a p-value is computed by comparing to a normal (or chi-square) reference — a standard hypothesis test run once per SNP. Because a GWAS runs this test hundreds of thousands to millions of times, it is a massive multiple-testing problem.
The genome-wide significance threshold
Naively declaring “significant” at would produce tens of thousands of false positives. The field instead uses a Bonferroni-style threshold that accounts for roughly one million effectively independent tests across the common variation of the human genome:
A SNP is declared genome-wide significant when its . The count is “effective” rather than literal because nearby SNPs are correlated through linkage disequilibrium, so the millions of tested SNPs behave like about a million independent ones.
Manhattan and QQ plots
Two plots summarize a GWAS. The Manhattan plot shows for every SNP against genomic position; real signals appear as skyscraper-like towers of correlated SNPs rising above the horizon line at . The QQ plot compares the observed distribution of to the uniform distribution expected under the global null; points should track the diagonal until a few true hits pull away at the top-right tail.
Genomic inflation factor
Systematic early departure of the whole QQ plot from the diagonal signals confounding. It is quantified by the genomic inflation factor , the ratio of the observed median chi-square statistic to its null expectation:
A well-controlled study has ; warns of inflation.
Confounders
The two classic confounders are:
- Population stratification. If ancestry correlates with both allele frequency and the trait, SNPs that merely track ancestry look associated. The standard fix is to include the top principal components of the genotype matrix as covariates — see population stratification.
- Relatedness. Cryptic relatives and family structure violate the independence assumption. Linear mixed models absorb this by adding a random effect with covariance given by a genetic relatedness matrix.
Worked example
Suppose we regress a standardized quantitative phenotype on a single SNP’s dosage in people. The fit returns with . The test statistic is , giving a two-sided . This clears the nominal bar easily but falls short of genome-wide significance (), so in a real scan it would not yet count as a discovery.
In code
R
set.seed(1)
n <- 500; m <- 5 # 500 people, 5 SNPs
G <- matrix(rbinom(n * m, 2, 0.3), n, m) # dosages 0/1/2, MAF 0.3
y <- 0.4 * G[, 3] + rnorm(n) # SNP 3 truly affects y
# Test every SNP with lm, collect effect + p-value
res <- t(apply(G, 2, function(g) {
s <- summary(lm(y ~ g))$coefficients["g", ]
$ c(beta = s[1], p = s[4])
}))
res
# SNP 3 has the largest |beta| and smallest p (~1e-30)
# For case/control use glm(status ~ g, family = binomial)
Python
import numpy as np, statsmodels.api as sm
rng = np.random.default_rng(1)
n, m = 500, 5
G = rng.binomial(2, 0.3, size=(n, m)).astype(float) # dosages 0/1/2
y = 0.4 * G[:, 2] + rng.normal(size=n) # SNP index 2 real
for j in range(m):
X = sm.add_constant(G[:, j])
fit = sm.OLS(y, X).fit()
print(j, round(fit.params[1], 3), f"{fit.pvalues[1]:.1e}")
# SNP 2 (0-indexed) shows the strongest effect and smallest p
# Logistic: sm.Logit(status, X).fit()
0 0.051 4.6e-01
1 0.044 5.0e-01
2 0.446 5.4e-11
3 0.039 5.6e-01
4 -0.029 6.7e-01
Julia
using GLM, DataFrames, Random, Distributions
Random.seed!(1)
n, m = 500, 5
G = Float64.(rand(Binomial(2, 0.3), n, m)) # dosages 0/1/2
y = 0.4 .* G[:, 3] .+ randn(n) # SNP 3 real
for j in 1:m
df = DataFrame(g = G[:, j], y = y)
fit = lm(@formula(y ~ g), df)
println(j, " β=", round(coef(fit)[2], digits=3),
" p=", coeftable(fit).cols[4][2])
end
# SNP 3 has the largest effect and smallest p
Why it matters
GWAS has identified tens of thousands of robust trait-associated loci, seeding drug targets, biological hypotheses, and downstream tools. Its outputs feed directly into polygenic scores that sum many small effects for prediction, into heritability estimates from genome-wide data, and into Mendelian randomization for causal inference. The study design is also a lesson in disciplined statistics: honest multiple-testing control and careful confounder adjustment are what separate real biology from artifacts.