Generalized Linear Models

Generalized linear models (GLMs) are a single framework that unifies linear, logistic, and Poisson regression. By separating the outcome’s distribution from the way its mean depends on predictors, GLMs let epidemiologists model continuous measurements, binary events, and event counts with the same fitting machinery.

Three components

Every GLM is built from three pieces:

The link is what frees the model from the constraints of ordinary linear regression: it maps a bounded mean (a probability in [0,1][0,1], a positive rate) onto the unbounded scale where a linear predictor can roam freely.

Three familiar special cases

Choosing the distribution and link recovers the standard models:

DistributionLink g(μ)g(\mu)Model
Normalidentity, μ\mulinear regression
Binomiallogit, logμ1μ\log\frac{\mu}{1-\mu}logistic regression
Poissonlog, logμ\log\muPoisson regression

So the identity link with a normal outcome is exactly the linear model μ=xβ\mu = x^\top\beta, while the logit link with a binomial outcome is logistic regression.

Poisson regression for counts and rates

For count outcomes we use the Poisson distribution with a log link: logμ=xβμ=exβ.\log\mu = x^\top\beta \quad\Longleftrightarrow\quad \mu = e^{x^\top\beta}. The log link guarantees a positive mean and makes covariate effects multiplicative: eβje^{\beta_j} is a rate ratio. To model a rate rather than a raw count, we add an offset log(exposure)\log(\text{exposure}) with a fixed coefficient of 11: logμ=log(exposure)+xβ,\log\mu = \log(\text{exposure}) + x^\top\beta, so that μ=exposureexβ\mu = \text{exposure}\cdot e^{x^\top\beta}. The exposure is typically person-time or population size, turning the model into one for cases per unit of population.

Fitting: deviance and IRLS

All GLMs are fit by maximum likelihood, using iteratively reweighted least squares (IRLS): each iteration solves a weighted linear regression on a linearized version of the response until the coefficients converge. Goodness of fit is measured by the deviance, twice the gap between the log-likelihood of the fitted model and that of a saturated model; it generalizes the residual sum of squares and drives likelihood-ratio tests between nested models.

Overdispersion

The Poisson model assumes the variance equals the mean. Real count data are often overdispersed: their variance exceeds the mean, inflating precision and shrinking standard errors if ignored. Two common remedies are quasi-Poisson, which multiplies the variance by an estimated dispersion parameter ϕ\phi so that Var(y)=ϕμ\operatorname{Var}(y) = \phi\mu, and the negative binomial, which adds an explicit extra-variance parameter through a different exponential-family distribution.

Worked example: case counts with a population offset

Two districts report new cases over a year. District A has 3030 cases in a population of 10,00010{,}000; District B has 6060 cases in a population of 40,00040{,}000. Fit logμ=β0+β1B+log(pop)\log\mu = \beta_0 + \beta_1\,\text{B} + \log(\text{pop}), where B\text{B} indicates District B. The rates are 30/10000=0.003030/10000 = 0.0030 and 60/40000=0.001560/40000 = 0.0015 per person-year. The rate ratio for B versus A is 0.0015/0.0030=0.50.0015/0.0030 = 0.5, so β^1=log(0.5)0.69\hat\beta_1 = \log(0.5) \approx -0.69 and eβ^1=0.5e^{\hat\beta_1} = 0.5: District B has half the incidence rate. Without the offset, District B’s larger raw count would misleadingly suggest a higher burden.

In code

R

district <- c("A", "B")
cases <- c(30, 60)
pop   <- c(10000, 40000)

fit <- glm(cases ~ district, family = poisson,
           offset = log(pop))
summary(fit)
exp(coef(fit))     # districtB rate ratio ~ 0.5

Python

import numpy as np
import statsmodels.api as sm

cases = np.array([30, 60])
pop   = np.array([10000, 40000])
X = sm.add_constant([0, 1])          # 0 = A, 1 = B

fit = sm.GLM(cases, X, family=sm.families.Poisson(),
             offset=np.log(pop)).fit()
print(np.exp(fit.params))            # B rate ratio ~ 0.5
[0.003 0.5  ]

Julia

using GLM, DataFrames
df = DataFrame(cases = [30, 60],
               district = ["A", "B"],
               logpop = log.([10000, 40000]))

fit = glm(@formula(cases ~ district), df, Poisson(), LogLink();
          offset = df.logpop)
exp.(coef(fit))        # districtB rate ratio ~ 0.5

Why it matters

GLMs give analysts one coherent language for regression across continuous, binary, and count outcomes, so the same ideas of coefficients, standard errors, and deviance carry over unchanged. This unification is why incidence-rate models, dose-response curves, and risk models can all be fit, compared, and interpreted with a single well-understood toolkit.