Detecting Selection with dN/dS
When a pathogen evolves, some mutations rewrite its proteins and some leave them untouched. Comparing how fast these two kinds of change accumulate turns a protein-coding gene into a detector for natural selection.
Synonymous vs. non-synonymous substitutions
The genetic code is redundant: most amino acids are encoded by several codons, so a nucleotide change may or may not alter the protein.
A synonymous substitution changes the codon but not the amino acid it specifies (for example GGA to GGG, both glycine); because the protein is unchanged, it is approximately invisible to selection and evolves close to neutrally.
A non-synonymous substitution changes the encoded amino acid (for example GGA glycine to GAA glutamate); it alters the protein and is therefore exposed to selection, which can remove it or promote it.
The neutral background rate of synonymous change gives us a built-in control against which to measure the fate of non-synonymous change.
Defining , , and
To make the comparison fair we normalize by opportunity, because a gene has more sites where a mutation would be non-synonymous than where it would be synonymous. Let be the number of synonymous substitutions per synonymous site and the number of non-synonymous substitutions per non-synonymous site. The quantity of interest is their ratio, Since estimates the neutral rate, measures how selection has scaled the rate of protein-changing substitutions relative to neutral expectation:
- — purifying (negative) selection: most amino-acid changes are deleterious and are removed, so is suppressed below .
- — neutral evolution: amino-acid changes fix at the same rate as synonymous ones.
- — positive (diversifying) selection: amino-acid changes are favored and fix faster than neutral, a classic signal at antigenic sites and drug-resistance loci in pathogens.
Purifying selection dominates most genes, so a gene-wide well below 1 is typical; values above 1 are the striking exception worth hunting for.
Codon models and where selection hides
Simply counting differences (as below) ignores multiple hits at the same site and the structure of the genetic code, so modern analyses fit codon-substitution models by maximum likelihood, estimating as a model parameter along the branches of a phylogeny. Averaging over a whole gene is conservative: a handful of positively selected codons can be swamped by many conserved ones. Site models allow to vary across codons and test whether a subset of sites has , pinpointing, say, the antigenic residues of an influenza surface protein. Branch models let vary across lineages, testing whether selection intensified on a particular branch — for instance, the emergence of a drug-resistant clade. These tests compare a model that permits against one that does not, a nested-model comparison evaluated with the framework of hypothesis testing.
Worked example
Suppose we align two homologous pathogen sequences of 300 codons (900 nucleotides) and, using the code’s structure, classify the sites and count observed differences:
- Synonymous sites: ; observed synonymous differences: .
- Non-synonymous sites: ; observed non-synonymous differences: .
The crude proportions of differences are Taking these proportions directly as rough distances gives An of about 0.5 points to purifying selection: protein-changing substitutions accumulate at half the neutral rate, so on average amino-acid changes here are mildly deleterious. A rigorous analysis would apply a multiple-hits correction (such as Nei–Gojobori’s Jukes–Cantor step) to convert and into and , but the sign of the signal, , is already clear.
In code
The snippets reproduce the worked counts and the crude ratio, then point to library routines that add the proper corrections.
R
# Crude omega from counts (matches the worked example)
Sd <- 18; S <- 225 # synonymous diffs and sites
Nd <- 27; N <- 675 # non-synonymous diffs and sites
pS <- Sd / S; pN <- Nd / N
omega <- pN / pS
omega # 0.5 -> purifying selection
# Proper estimation from an alignment with seqinr's Nei-Gojobori kaks():
# library(seqinr)
# aln <- read.alignment("genes.fasta", format = "fasta")
# kaks(aln)ks # ka = dN, ks = dS
# (ape can read/handle trees; codon-model fitting lives in tools like codeml/PAML)
Python
# Crude omega from counts
Sd, S = 18, 225 # synonymous diffs, sites
Nd, N = 27, 675 # non-synonymous diffs, sites
pS, pN = Sd / S, Nd / N
omega = pN / pS
print(round(omega, 3)) # 0.5 -> purifying selection
# Proper Nei-Gojobori estimation from an alignment with Biopython:
# from Bio import SeqIO
# from Bio.codonalign.codonseq import cal_dn_ds, CodonSeq
# s1 = CodonSeq(str(rec1.seq)); s2 = CodonSeq(str(rec2.seq))
# dN, dS = cal_dn_ds(s1, s2, method="NG86")
# print(dN / dS)
0.5
Julia
# Crude omega from counts (manual)
Sd, S = 18, 225 # synonymous diffs, sites
Nd, N = 27, 675 # non-synonymous diffs, sites
pS, pN = Sd / S, Nd / N
omega = pN / pS
println(round(omega, digits = 3)) # 0.5 -> purifying selection
# Optional Jukes-Cantor correction before taking the ratio:
jc(p) = -3/4 * log(1 - 4/3 * p) # multiple-hits correction
omega_corrected = jc(pN) / jc(pS)
println(round(omega_corrected, digits = 3)) # ~0.49, same conclusion
Why it matters
converts a protein alignment into a quantitative test for natural selection, using the gene’s own synonymous sites as an internal neutral yardstick. For pathogens this is directly actionable: flags the antigenic sites that let a virus escape immunity and the codons where drug resistance is being selected, telling surveillance and vaccine design exactly which residues are under evolutionary pressure.