Good Programming Practices
Code you write for an analysis is read far more often than it is written — by reviewers, by collaborators, and by you six months from now. A few habits make your code easier to read, harder to break, and cheaper to fix.
Name things well
Names are the cheapest documentation you have. Spend them wisely.
- Functions are verbs / actions:
simulate_epidemic,fit_model,load_cases. - Variables are nouns:
case_count,contact_matrix,posterior_draws. - Booleans are yes/no questions:
is_infected,has_converged,should_resample.
# BAD: opaque, abbreviated, ambiguous
d <- read.csv("f.csv")
x <- d[d$v > 0, ]
$flag <- nrow(x) > 100
# GOOD: names say what things are
cases <- read.csv("cases.csv")
positive_cases <- cases[cases$viral_load > 0, ]
$has_many_positives <- nrow(positive_cases) > 100
# GOOD
cases = pd.read_csv("cases.csv")
positive_cases = cases[cases["viral_load"] > 0]
has_many_positives = len(positive_cases) > 100
# GOOD
cases = CSV.read("cases.csv", DataFrame)
positive_cases = filter(row -> row.viral_load > 0, cases)
has_many_positives = nrow(positive_cases) > 100
Write small functions that do one thing
If you can’t describe a function without saying “and”, it probably wants to be two functions. Small, single-purpose functions are easy to name, test, and reuse.
# BAD: one function loads, cleans, models, and plots
analyze <- function(path) {
d <- read.csv(path)
d <- d[!is.na(d$y), ]
$ dy)
m <- lm(logy ~ x, data = d)
plot(dlogy); abline(m)
return(m)
}
# GOOD: each step is its own verb
load_data <- function(path) read.csv(path)
clean_data <- function(d) transform(d[!is.na(d$y), ], logy = log(y))
$fit_model <- function(d) lm(logy ~ x, data = d)
Don’t Repeat Yourself (DRY)
Copy-pasted code drifts: you fix a bug in one copy and forget the other three. When you see the same lines twice, extract a function.
# BAD: same transformation, three times, easy to get out of sync
train_z = (train - train.mean()) / train.std()
valid_z = (valid - train.mean()) / train.std()
test_z = (test - train.mean()) / train.std()
# GOOD: one definition, one place to fix
def standardize(x, center, scale):
return (x - center) / scale
mu, sigma = train.mean(), train.std()
train_z, valid_z, test_z = (standardize(s, mu, sigma) for s in (train, valid, test))
Avoid magic numbers
A bare 0.05 or 1000 buried in code is a mystery.
Give it a name.
# BAD
if (p_value < 0.05) reject <- TRUE
draws <- rnorm(10000)
# GOOD
significance_level <- 0.05
n_draws <- 10000L
reject <- p_value < significance_level
draws <- rnorm(n_draws)
Fail loudly
A wrong answer is worse than an error. Check your assumptions and stop early with a clear message rather than silently producing nonsense.
# GOOD: validate inputs up front
estimate_rate <- function(counts, exposure) {
stopifnot(
length(counts) == length(exposure),
all(exposure > 0)
)
sum(counts) / sum(exposure)
}
# GOOD
def estimate_rate(counts, exposure):
if len(counts) != len(exposure):
raise ValueError("counts and exposure must be the same length")
if any(e <= 0 for e in exposure):
raise ValueError("exposure must be positive")
return sum(counts) / sum(exposure)
# GOOD
function estimate_rate(counts, exposure)
@assert length(counts) == length(exposure) "lengths must match"
@assert all(>(0), exposure) "exposure must be positive"
sum(counts) / sum(exposure)
end
Comment the why, and format consistently
- Comment intent and gotchas, not the obvious (
i = i + 1 # add onehelps no one). - Pick a style and let a formatter enforce it:
styler/lintr(R),black/ruff(Python),JuliaFormatter.jl(Julia). Consistency removes noise from diffs and lets reviewers focus on substance.
# BAD: explains the code we can already read
x <- x + 1 # increment x
# GOOD: explains why
# Offset by 1 because the assay reports 0-based well indices.
well_index <- well_index + 1
A messy snippet, refactored
# BAD: unnamed steps, magic numbers, repetition, no checks
f <- function(a) {
b <- a[a[,2] > 0.05,]
m1 <- mean(b[,1]); m2 <- mean(b[b[,3]==1,1]); m3 <- mean(b[b[,3]==0,1])
c(m1, m2, m3)
}
# GOOD: named, checked, DRY
group_mean <- function(df, group_value) {
mean(dfgroup == group_value])
}
summarize_by_group <- function(df, min_weight = 0.05) {
stopifnot(all(c("value", "weight", "group") %in% names(df)))
kept <- df[df$weight > min_weight, ]
$ c(
overall = mean(kept$value),
$ treated = group_mean(kept, 1),
control = group_mean(kept, 0)
)
}
Learn your editor
You spend more time in your text editor than almost any other tool, so fluency there compounds over a career. Pick one and learn it deeply — the specific choice matters less than the investment.
- Vim / Neovim — modal editors built around composable keystrokes for editing at the speed of thought.
Some form of
viis preinstalled on essentially every Unix server, so the skill travels everywhere; runvimtutorfrom a terminal for a 30-minute hands-on start. Neovim is the modernized fork, configured in Lua with first-class Language Server support for completion and diagnostics — a popular starting config is kickstart.nvim. - Doom Emacs — a fast, batteries-included configuration of Emacs that ships Vim keybindings (via
evil-mode), so you get modal editing plus Emacs’s ecosystem. Its Org mode is a powerful home for literate, reproducible notebooks and notes.
Prefer a conventional IDE? VS Code, RStudio, and Positron are all excellent — and most, including these, offer a Vim-keybindings mode so you can borrow the muscle memory without leaving. Whatever you choose, pair it with version control and a task runner so your editor, history, and pipeline reinforce each other.