Detecting Selection with dN/dS

When a pathogen evolves, some mutations rewrite its proteins and some leave them untouched. Comparing how fast these two kinds of change accumulate turns a protein-coding gene into a detector for natural selection.

Synonymous vs. non-synonymous substitutions

The genetic code is redundant: most amino acids are encoded by several codons, so a nucleotide change may or may not alter the protein. A synonymous substitution changes the codon but not the amino acid it specifies (for example GGA to GGG, both glycine); because the protein is unchanged, it is approximately invisible to selection and evolves close to neutrally. A non-synonymous substitution changes the encoded amino acid (for example GGA glycine to GAA glutamate); it alters the protein and is therefore exposed to selection, which can remove it or promote it. The neutral background rate of synonymous change gives us a built-in control against which to measure the fate of non-synonymous change.

Defining dNd_N, dSd_S, and ω\omega

To make the comparison fair we normalize by opportunity, because a gene has more sites where a mutation would be non-synonymous than where it would be synonymous. Let dSd_S be the number of synonymous substitutions per synonymous site and dNd_N the number of non-synonymous substitutions per non-synonymous site. The quantity of interest is their ratio, ω=dNdS.\omega = \frac{d_N}{d_S}. Since dSd_S estimates the neutral rate, ω\omega measures how selection has scaled the rate of protein-changing substitutions relative to neutral expectation:

Purifying selection dominates most genes, so a gene-wide ω\omega well below 1 is typical; values above 1 are the striking exception worth hunting for.

Codon models and where selection hides

Simply counting differences (as below) ignores multiple hits at the same site and the structure of the genetic code, so modern analyses fit codon-substitution models by maximum likelihood, estimating ω\omega as a model parameter along the branches of a phylogeny. Averaging ω\omega over a whole gene is conservative: a handful of positively selected codons can be swamped by many conserved ones. Site models allow ω\omega to vary across codons and test whether a subset of sites has ω>1\omega > 1, pinpointing, say, the antigenic residues of an influenza surface protein. Branch models let ω\omega vary across lineages, testing whether selection intensified on a particular branch — for instance, the emergence of a drug-resistant clade. These tests compare a model that permits ω>1\omega > 1 against one that does not, a nested-model comparison evaluated with the framework of hypothesis testing.

Worked example

Suppose we align two homologous pathogen sequences of 300 codons (900 nucleotides) and, using the code’s structure, classify the sites and count observed differences:

The crude proportions of differences are pS=SdS=18225=0.080,pN=NdN=27675=0.040.p_S = \frac{S_d}{S} = \frac{18}{225} = 0.080, \qquad p_N = \frac{N_d}{N} = \frac{27}{675} = 0.040. Taking these proportions directly as rough distances gives ω=dNdSpNpS=0.0400.080=0.5.\omega = \frac{d_N}{d_S} \approx \frac{p_N}{p_S} = \frac{0.040}{0.080} = 0.5. An ω\omega of about 0.5 points to purifying selection: protein-changing substitutions accumulate at half the neutral rate, so on average amino-acid changes here are mildly deleterious. A rigorous analysis would apply a multiple-hits correction (such as Nei–Gojobori’s Jukes–Cantor step) to convert pNp_N and pSp_S into dNd_N and dSd_S, but the sign of the signal, ω<1\omega < 1, is already clear.

In code

The snippets reproduce the worked counts and the crude ratio, then point to library routines that add the proper corrections.

R

# Crude omega from counts (matches the worked example)
Sd <- 18; S <- 225   # synonymous diffs and sites
Nd <- 27; N <- 675   # non-synonymous diffs and sites
pS <- Sd / S; pN <- Nd / N
omega <- pN / pS
omega                 # 0.5  -> purifying selection

# Proper estimation from an alignment with seqinr's Nei-Gojobori kaks():
# library(seqinr)
# aln <- read.alignment("genes.fasta", format = "fasta")
# kaks(aln)ka/kaks(aln)ka / kaks(aln)ks   # ka = dN, ks = dS
# (ape can read/handle trees; codon-model fitting lives in tools like codeml/PAML)

Python

# Crude omega from counts
Sd, S = 18, 225   # synonymous diffs, sites
Nd, N = 27, 675   # non-synonymous diffs, sites
pS, pN = Sd / S, Nd / N
omega = pN / pS
print(round(omega, 3))   # 0.5 -> purifying selection

# Proper Nei-Gojobori estimation from an alignment with Biopython:
# from Bio import SeqIO
# from Bio.codonalign.codonseq import cal_dn_ds, CodonSeq
# s1 = CodonSeq(str(rec1.seq)); s2 = CodonSeq(str(rec2.seq))
# dN, dS = cal_dn_ds(s1, s2, method="NG86")
# print(dN / dS)
0.5

Julia

# Crude omega from counts (manual)
Sd, S = 18, 225      # synonymous diffs, sites
Nd, N = 27, 675      # non-synonymous diffs, sites
pS, pN = Sd / S, Nd / N
omega = pN / pS
println(round(omega, digits = 3))   # 0.5 -> purifying selection

# Optional Jukes-Cantor correction before taking the ratio:
jc(p) = -3/4 * log(1 - 4/3 * p)     # multiple-hits correction
omega_corrected = jc(pN) / jc(pS)
println(round(omega_corrected, digits = 3))   # ~0.49, same conclusion

Why it matters

dN/dSd_N/d_S converts a protein alignment into a quantitative test for natural selection, using the gene’s own synonymous sites as an internal neutral yardstick. For pathogens this is directly actionable: ω>1\omega > 1 flags the antigenic sites that let a virus escape immunity and the codons where drug resistance is being selected, telling surveillance and vaccine design exactly which residues are under evolutionary pressure.