Manipulating Data Frames

Most of applied analysis is data wrangling — reshaping, filtering, and summarizing tables before any model is fit. The same handful of verbs shows up in every ecosystem, so this page is a Rosetta stone across R (dplyr and data.table), Python (pandas and Polars), and Julia (DataFrames.jl).

The core verbs

Almost every pipeline is built from six operations.

Operationdplyr (R)data.table (R)pandas (Python)Polars (Python)DataFrames.jl (Julia)
pick columnsselect(df, a, b)dt[, .(a, b)]df[["a", "b"]]df.select("a", "b")select(df, :a, :b)
filter rowsfilter(df, x > 0)dt[x > 0]df[df.x > 0]df.filter(pl.col("x") > 0)subset(df, :x => ByRow(>(0)))
new columnmutate(df, y = 2*x)dt[, y := 2*x]df.assign(y=2*df.x)df.with_columns(y=2*pl.col("x"))transform(df, :x => ByRow(x -> 2x) => :y)
sortarrange(df, x)dt[order(x)]df.sort_values("x")df.sort("x")sort(df, :x)
group + summarize`group_by(df, g)> summarise(m = mean(x))`dt[, .(m = mean(x)), by = g]df.groupby("g").agg(m=("x","mean"))df.group_by("g").agg(m=pl.col("x").mean())
joinleft_join(a, b, by = "id")b[a, on = "id"]a.merge(b, on="id", how="left")a.join(b, on="id", how="left")leftjoin(a, b, on = :id)

A worked pipeline

The same task in each dialect: from weekly case counts, keep the first two weeks, add a per-100k rate, then summarize total cases and mean rate by region.

R — dplyr

library(dplyr)
df <- tibble(
  region = c("A", "A", "A", "B", "B", "B"),
  week   = c(1, 2, 3, 1, 2, 3),
  cases  = c(10, 15, 9, 3, 7, 12),
  pop    = c(1e5, 1e5, 1e5, 5e4, 5e4, 5e4)
)
df |>
  filter(week <= 2) |>
  mutate(rate = cases / pop * 1e5) |>
  group_by(region) |>
  summarise(total = sum(cases), mean_rate = mean(rate))

R — data.table

library(data.table)
dt <- data.table(
  region = c("A", "A", "A", "B", "B", "B"),
  week   = c(1, 2, 3, 1, 2, 3),
  cases  = c(10, 15, 9, 3, 7, 12),
  pop    = c(1e5, 1e5, 1e5, 5e4, 5e4, 5e4)
)
dt[week <= 2
   ][, rate := cases / pop * 1e5
   ][, .(total = sum(cases), mean_rate = mean(rate)), by = region]

Python — pandas

import pandas as pd
df = pd.DataFrame({
    "region": ["A", "A", "A", "B", "B", "B"],
    "week":   [1, 2, 3, 1, 2, 3],
    "cases":  [10, 15, 9, 3, 7, 12],
    "pop":    [1e5, 1e5, 1e5, 5e4, 5e4, 5e4],
})
out = (
    df[df["week"] <= 2]
    .assign(rate=lambda d: d["cases"] / d["pop"] * 1e5)
    .groupby("region", as_index=False)
    .agg(total=("cases", "sum"), mean_rate=("rate", "mean"))
)
print(out)
  region  total  mean_rate
0      A     25       12.5
1      B     10       10.0

Python — Polars

Polars uses an expression API (pl.col(...)), reads as a clean pipeline, and has a lazy mode (pl.scan_csv(...).collect()) that optimizes the whole query — often much faster on large data.

import polars as pl
df = pl.DataFrame({
    "region": ["A", "A", "A", "B", "B", "B"],
    "week":   [1, 2, 3, 1, 2, 3],
    "cases":  [10, 15, 9, 3, 7, 12],
    "pop":    [1e5, 1e5, 1e5, 5e4, 5e4, 5e4],
})
out = (
    df.filter(pl.col("week") <= 2)
      .with_columns(rate=pl.col("cases") / pl.col("pop") * 1e5)
      .group_by("region")
      .agg(total=pl.col("cases").sum(), mean_rate=pl.col("rate").mean())
      .sort("region")
)
print(out)
shape: (2, 3)
┌────────┬───────┬───────────┐
│ region ┆ total ┆ mean_rate │
│ ---    ┆ ---   ┆ ---       │
│ str    ┆ i64   ┆ f64       │
╞════════╪═══════╪═══════════╡
│ A      ┆ 25    ┆ 12.5      │
│ B      ┆ 10    ┆ 10.0      │
└────────┴───────┴───────────┘

Julia — DataFrames.jl

using DataFrames, Statistics
df = DataFrame(region = ["A","A","A","B","B","B"],
               week   = [1, 2, 3, 1, 2, 3],
               cases  = [10, 15, 9, 3, 7, 12],
               pop    = [1e5, 1e5, 1e5, 5e4, 5e4, 5e4])
sub = subset(df, :week => ByRow(<=(2)))
transform!(sub, [:cases, :pop] => ByRow((c, p) -> c / p * 1e5) => :rate)
combine(groupby(sub, :region),
        :cases => sum => :total,
        :rate  => mean => :mean_rate)

DataFramesMeta.jl adds a @chain/@rsubset/@transform layer that reads much like dplyr if you prefer piped, macro-based syntax.

Reshaping: long ↔ wide

Tidy analysis usually wants long data (one row per observation); reporting often wants wide.

Which should you use?

Why it matters

Clean, correct data manipulation is the foundation every analysis on this site stands on — a summary, a regression, or a fitted epidemic model is only as good as the table feeding it. Knowing the same verbs across languages lets you read collaborators’ code and pick the right tool for the dataset in front of you.