Logistic Regression
Logistic regression models the probability of a binary outcome, such as diseased versus healthy or exposed versus not. It is the standard tool for risk modeling and case-control studies in epidemiology, because its coefficients translate directly into odds ratios.
From probability to log-odds
A linear regression can predict values outside , which makes no sense for a probability. Logistic regression instead models the log-odds as a linear function of the predictors: Solving for gives the logistic (sigmoid) function, which is bounded between and : The building blocks here are the odds and the logarithm that maps them onto the whole real line.
Coefficients are log-odds-ratios
Because the linear predictor lives on the log-odds scale, a one-unit increase in adds to the log-odds. Exponentiating recovers a multiplicative effect on the odds, so is an odds ratio: An odds ratio above means the predictor increases the odds of the outcome; below means it decreases them; exactly means no association.
Fitting by maximum likelihood
Each observation is a Bernoulli trial with success probability . The log-likelihood is Unlike OLS there is no closed form, so we maximize numerically by maximum likelihood. The standard algorithm is Newton–Raphson, which for this model takes the form of iteratively reweighted least squares (IRLS): repeatedly solving a weighted linear regression until the coefficients converge.
Deviance and fit
Model fit is summarized by the deviance, , the logistic analogue of the residual sum of squares. Comparing the deviance of nested models gives a likelihood-ratio test for whether added predictors improve fit. Coefficient standard errors come from the curvature of the log-likelihood, yielding Wald tests and confidence intervals for each (and, after exponentiating, for each odds ratio).
Worked example: reading a coefficient
Suppose a model of infection uses a single predictor, hours of exposure, and returns and . The slope’s odds ratio is : each additional hour of exposure roughly doubles the odds of infection.
Now convert a linear predictor to a probability for someone with hours of exposure. The linear predictor is . Then about a predicted probability of infection.
In code
R
set.seed(1)
x <- rnorm(300)
p <- 1 / (1 + exp(-(-0.5 + 1.2 * x)))
y <- rbinom(300, 1, p)
fit <- glm(y ~ x, family = binomial)
summary(fit) # coefficients on log-odds scale
exp(coef(fit)) # odds ratios; slope ~ exp(1.2) ~ 3.3
predict(fit, type = "response")[1:5] # fitted probabilities
Python
import numpy as np
import statsmodels.api as sm
rng = np.random.default_rng(1)
x = rng.normal(size=300)
p = 1 / (1 + np.exp(-(-0.5 + 1.2 * x)))
y = rng.binomial(1, p)
X = sm.add_constant(x)
fit = sm.Logit(y, X).fit()
print(fit.params) # log-odds scale
print(np.exp(fit.params)) # odds ratios; slope ~ exp(1.2) ~ 3.3
print(fit.predict(X)[:5]) # fitted probabilities
Optimization terminated successfully.
Current function value: 0.530359
Iterations 6
[-0.52992636 1.46561641]
[0.58864832 4.3302116 ]
[0.49414214 0.6624543 0.48859389 0.08018361 0.68932769]
Julia
using GLM, DataFrames, Random
Random.seed!(1)
x = randn(300)
p = 1 ./ (1 .+ exp.(-(-0.5 .+ 1.2 .* x)))
y = rand.(Bernoulli.(p))
df = DataFrame(x = x, y = Int.(y))
fit = glm(@formula(y ~ x), df, Binomial(), LogitLink())
coef(fit) # log-odds scale
exp.(coef(fit)) # odds ratios; slope ~ exp(1.2) ~ 3.3
predict(fit)[1:5] # fitted probabilities
Why it matters
Logistic regression is the default model whenever the outcome is yes/no, and its odds ratios are the currency of epidemiological risk reporting. It scales from a single exposure to genome-wide association studies (GWAS), where each variant’s effect on disease status is fit as a logistic coefficient across millions of tests.