Data Structures & Choosing the Right Container

A data structure is just a way of organizing data in memory so that the operations you care about are cheap. It is the practical companion to Big-O notation: the same algorithm can be fast or hopelessly slow depending only on which container you store your data in. Choosing well is often the single easiest way to turn an overnight run into a one-second one.

The key idea is that different containers make different operations cheap, and you pick the one whose cheap operation is the one you do most.

Doing n membership tests against a list scans the whole list each time — O(n) per lookup, O(n²) total — while the same tests against a set or dictionary are O(1) each, O(n) total; at 100,000 items that is a 50,000-fold difference.

The Four You Actually Need

For everyday scientific computing, almost everything is one of these:

Structure	Also called	Cheap operation	Use it when…
Array / vector	list, `numpy` array	read/write by position `x[i]` (`O(1)`); scan (`O(n)`)	you have an ordered sequence and mostly march through it
Hash map	dictionary, `dict`, named list, environment	look up by key (`O(1)`)	you look things up by a label: gene → count, id → record
Hash set	`set`	“have I seen this?” (`O(1)`)	you track membership or unique items
Tree / graph	—	hierarchical / linked relationships	phylogenies, contact networks, taxonomies

The two that most often rescue a slow script are the hash map and hash set, because they turn a linear search into a constant-time lookup.

The Move That Fixes Most Slow Code: List → Set

Here is the O(n²) trap in its most common disguise. You have a collection and you keep asking “is this thing already in it?” If the collection is a list, each check scans the whole list, so n checks cost O(n²):

# SLOW: `x in a_list` scans the list every time -> O(n^2) overall
seen = []
for s in samples:
    if s not in seen:          # O(n) scan, repeated n times
        seen.append(s)

Switch the collection to a set and each check becomes O(1), so the whole loop drops to O(n) — the difference the figure above shows:

# FAST: `x in a_set` is O(1) -> O(n) overall
seen = set()
for s in samples:
    if s not in seen:          # O(1) hash lookup
        seen.add(s)

You can watch the operation counts diverge exactly as the doubling test predicts — the list version quadruples each time n doubles, the set version merely doubles:

def list_lookups(n):
    """n membership tests against a growing list: O(n^2)."""
    data, ops = list(range(n)), 0
    for q in range(n):
        for item in data:          # linear scan until found
            ops += 1
            if item == q:
                break
    return ops

def set_lookups(n):
    """n membership tests against a set: O(n) (each test is O(1))."""
    data = set(range(n))
    return sum(1 for q in range(n) if q in data)

print(f"{'n':>5} | {'list ops':>10} | {'set ops':>8}")
print("-" * 31)
for n in [50, 100, 200, 400]:
    print(f"{n:>5} | {list_lookups(n):>10,} | {set_lookups(n):>8,}")

    n |   list ops |  set ops
-------------------------------
   50 |      1,275 |       50
  100 |      5,050 |      100
  200 |     20,100 |      200
  400 |     80,200 |      400

Hash Maps for Counting and Lookups

The other workhorse is the hash map: a lookup table keyed by something meaningful. Counting k-mers, tallying species, mapping a sample ID to its metadata, or caching a codon → amino-acid table are all one-liners with a dictionary, and every lookup is O(1).

# Count 3-mers in a sequence -- a dict keyed by the k-mer
seq, k = "ATGATCGATCGGATCGATCGATCG", 3
counts = {}
for i in range(len(seq) - k + 1):
    kmer = seq[i:i + k]
    counts[kmer] = counts.get(kmer, 0) + 1

for kmer in sorted(counts):
    print(kmer, counts[kmer])

ATC 5
ATG 1
CGA 3
CGG 1
GAT 5
GGA 1
TCG 5
TGA 1

The same idea across the three languages:

# R: a named list or environment is a hash map; table() counts in one call
counts <- table(substring(seq, 1:(nchar(seq) - k + 1), k:nchar(seq)))
seen <- new.env()                 # an environment is a real hash map
assign("ACGT", TRUE, envir = seen)
exists("ACGT", envir = seen)      # O(1) membership test

# Julia: Dict and Set are first-class and fast
counts = Dict{String,Int}()
counts[kmer] = get(counts, kmer, 0) + 1
seen = Set{String}()
push!(seen, "ACGT"); "ACGT" in seen      # O(1)

# Python: dict/set, plus Counter for the common counting case
from collections import Counter
counts = Counter(seq[i:i+k] for i in range(len(seq) - k + 1))

Trees and Graphs

Some biological data is fundamentally linked rather than flat, and a tree or graph is the honest representation:

Phylogenies are trees; you traverse them with recursion and dynamic programming (Felsenstein’s pruning walks a tree bottom-up).
Contact, food-web, and metabolic networks are graphs; storing them as an adjacency list (a hash map from each node to its neighbours) keeps “who touches whom?” cheap — see ecological networks and network models.

The practical point is the same: represent the relationships directly instead of flattening them into a list you have to search over and over.

Choosing in Practice

Look things up by a label? Hash map (dict / named list). Not a list you search.
Track “have I seen this?” or “unique values”? Hash set. Not a list with x in list.
Ordered data you scan or index by position? Array / vector — and prefer a vectorized operation over a hand-written loop.
Hierarchies or networks? Tree or graph (adjacency list), not a table of pairs.
When a script is mysteriously slow, look for a list being searched inside a loop — swapping it for a set or dict is usually the whole fix.

Big-O Notation & Computational Complexity — why the container changes the complexity
Recursion & Dynamic Programming — algorithms that walk trees and tables
A Simulation Toolkit — where fast lookups keep inner loops cheap
Manipulating Data Frames — the tabular container for analysis-ready data
Ecological Networks and Network Models — graphs in biology
Programming & Computing