Data Ingestion & APIs

A great deal of biological data lives in public repositories — sequences in GenBank, occurrence records in GBIF, pathogen genomes in GISAID, gene annotations in Ensembl. You could download it by clicking through a website, but that is slow, error-prone, and impossible to reproduce. An API (Application Programming Interface) lets your code request exactly the data you want, so the pull becomes a scripted, repeatable step in your analysis.

What an API Is

An API is a programmatic front door to a remote database or service. Your code sends a request — a web address (endpoint) plus query parameters describing what you want — and the server sends back structured data, usually as JSON or XML, along with a status code saying whether it worked.

Your code sends an HTTP request with query parameters and an API key to a remote database such as GenBank or GBIF, which returns JSON or XML plus a status code; you then cache the raw response and parse it into a tidy table, keeping keys out of code and respecting rate limits.

Fetching programmatically beats manual downloads because it is reproducible (the query is code, not a memory of which buttons you clicked), automatable (loop over 500 accessions), and current (re-run to refresh).

Parsing a Response

The response comes back as text you parse into a table. In real code you would fetch it over the network; here we parse a canned response, which is exactly what you should do in examples and tests — never hit a live server from a test:

import json

# a GBIF-style response (fetched elsewhere; parsed offline here)
response = '''{"count": 2, "results": [
  {"scientificName": "Aedes aegypti", "year": 2021, "countryCode": "BR"},
  {"scientificName": "Aedes aegypti", "year": 2022, "countryCode": "US"}
]}'''

data = json.loads(response)
print("records:", data["count"])
for rec in data["results"]:
    print(f'  {rec["scientificName"]:<15} {rec["year"]}  {rec["countryCode"]}')
records: 2
  Aedes aegypti   2021  BR
  Aedes aegypti   2022  US

Most sources have a dedicated client package that handles the requests and parsing for you — reach for those before writing raw HTTP calls:

# R: purpose-built clients for the big repositories
library(rgbif);  occ_search(scientificName = "Aedes aegypti", limit = 500)
library(rentrez); entrez_fetch(db = "nuccore", id = ids, rettype = "fasta")
# Python: pygbif, Biopython Entrez, or requests for a generic REST API
from pygbif import occurrences
occurrences.search(scientificName="Aedes aegypti", limit=500)

Keys, Rate Limits, and Etiquette

Pulling data from a shared server comes with responsibilities, and getting these wrong will (rightly) get you throttled or blocked.

Reproducibility of a Data Pull

Public databases change over time — sequences are added, taxonomies revised, records corrected — so the same query can return different data next year. To keep an analysis reproducible:

This is the ingestion-side complement to Reproducibility: a result is only reproducible if its inputs can be recovered.

Sources You’ll Meet in Biology

SourceHoldsClients
NCBI / GenBank / Entrezsequences, genomes, PubMedrentrez, Biopython Entrez
GBIFspecies occurrence recordsrgbif, pygbif
GISAIDinfluenza / SARS-CoV-2 genomes (credentialed)EpiCoV portal, feeds
Ensemblgene annotation, comparative genomicsbiomaRt, Ensembl REST
iNaturalist / OBISbiodiversity observationsrinat, robis
NOAA / Open-Meteoclimate & weather covariatesREST + requests/httr2

For any source without a ready client, a generic HTTP library (httr2 in R, requests in Python, HTTP.jl in Julia) plus a JSON parser will do — the pattern in the figure above is always the same.

A Short Checklist