Multiple Testing and False Discovery Rate
Run one hypothesis test and a 5% false-positive rate is tolerable; run a million and you drown in spurious “discoveries.” Correcting for the number of tests is what keeps large-scale screens — gene expression, imaging, and especially genome-wide scans — honest.
The problem
Suppose you perform independent tests, and in truth every null is true. If you reject whenever p , then each test has probability of a false positive, so the expected number of false positives is
With and that is 50,000 false alarms. The probability of at least one false positive is , which is essentially for even a few hundred tests. Some form of correction is unavoidable.
Family-wise error rate and Bonferroni
The family-wise error rate (FWER) is the probability of making one or more false rejections across the whole family of tests. The Bonferroni correction controls it by testing each hypothesis at the stricter level
equivalently by multiplying each p-value by and comparing to . By the union bound, this guarantees regardless of dependence between tests. It is simple and robust but conservative: when is huge, the threshold becomes so small that real but modest effects are missed.
The false discovery rate
Bonferroni asks “what is the chance of any false positive?” For large screens a gentler question is often more useful: “of the discoveries I declare, what fraction are false?” That is the false discovery rate (FDR), the expected proportion of false positives among rejected hypotheses:
where is the number of rejections and the number of those that are truly null. Controlling the FDR at, say, accepts that about 10% of your hits may be false in exchange for far more power than FWER control.
The Benjamini–Hochberg procedure
Benjamini and Hochberg (BH) control the FDR with a simple sort-and-compare rule. Order the p-values . Find the largest rank for which
and reject all hypotheses with rank . Under independence (and many forms of positive dependence) this guarantees . The q-value of a test is the smallest FDR at which it would be declared significant — the FDR analogue of a p-value.
Why GWAS uses
A genome-wide scan tests roughly one million effectively independent variants. The genome-wide significance threshold is exactly Bonferroni applied to that count, , controlling the FWER so that a declared association is very unlikely to be a fluke — see GWAS.
Worked example
Take five p-values from five tests: , with and .
Bonferroni: compare each to . Only and survive — two discoveries.
Benjamini–Hochberg: sorted, the thresholds are .
| rank | threshold? | ||
|---|---|---|---|
| 1 | 0.001 | 0.01 | yes |
| 2 | 0.008 | 0.02 | yes |
| 3 | 0.020 | 0.03 | yes |
| 4 | 0.040 | 0.04 | yes |
| 5 | 0.500 | 0.05 | no |
The largest passing rank is , so BH rejects the first four — more discoveries than Bonferroni, at the cost of controlling only the FDR.
In code
R
p <- c(0.001, 0.008, 0.02, 0.04, 0.5)
p.adjust(p, method = "bonferroni") # 0.005 0.040 0.100 0.200 1.000
p.adjust(p, method = "BH") # 0.005 0.020 0.033 0.050 0.500
which(p.adjust(p, "BH") <= 0.05) # 1 2 3 4
Python
from statsmodels.stats.multitest import multipletests
p = [0.001, 0.008, 0.02, 0.04, 0.5]
print(multipletests(p, alpha=0.05, method="bonferroni")[1])
# [0.005 0.04 0.1 0.2 1. ]
rej, q, *_ = multipletests(p, alpha=0.05, method="fdr_bh")
print(rej) # [ True True True True False]
print(q) # [0.005 0.02 0.0333 0.05 0.5 ]
[0.005 0.04 0.1 0.2 1. ]
[ True True True True False]
[0.005 0.02 0.03333333 0.05 0.5 ]
Julia
p = [0.001, 0.008, 0.02, 0.04, 0.5]
m = length(p)
# Manual Benjamini-Hochberg
o = sortperm(p)
ps = p[o]
thresh = (1:m) ./ m .* 0.05
kmax = findlast(ps .<= thresh) # 4
reject = falses(m); reject[o[1:kmax]] .= true
reject # first four true
# Package alternative: using MultipleTesting; adjust(PValues(p), BenjaminiHochberg())
Why it matters
Multiple-testing control is the reason large screens produce reproducible findings instead of a graveyard of false leads. Choosing between FWER control (Bonferroni, when any false positive is costly) and FDR control (BH, when you can tolerate a known fraction of them among many hits) is one of the central design decisions in genomics, neuroimaging, and any high-dimensional analysis.