Logistic Regression

Logistic regression models the probability of a binary outcome, such as diseased versus healthy or exposed versus not. It is the standard tool for risk modeling and case-control studies in epidemiology, because its coefficients translate directly into odds ratios.

Binary outcomes with the fitted logistic (sigmoid) probability curve.

From probability to log-odds

A linear regression can predict values outside [0,1][0,1], which makes no sense for a probability. Logistic regression instead models the log-odds as a linear function of the predictors: logit(p)=logp1p=xβ.\operatorname{logit}(p) = \log\frac{p}{1-p} = x^\top\beta. Solving for pp gives the logistic (sigmoid) function, which is bounded between 00 and 11: p=11+exβ.p = \frac{1}{1 + e^{-x^\top\beta}}. The building blocks here are the odds p/(1p)p/(1-p) and the logarithm that maps them onto the whole real line.

Coefficients are log-odds-ratios

Because the linear predictor lives on the log-odds scale, a one-unit increase in xjx_j adds βj\beta_j to the log-odds. Exponentiating recovers a multiplicative effect on the odds, so eβje^{\beta_j} is an odds ratio: ORj=eβj.\text{OR}_j = e^{\beta_j}. An odds ratio above 11 means the predictor increases the odds of the outcome; below 11 means it decreases them; exactly 11 means no association.

Fitting by maximum likelihood

Each observation is a Bernoulli trial with success probability pi=1/(1+exiβ)p_i = 1/(1+e^{-x_i^\top\beta}). The log-likelihood is (β)=i[yilogpi+(1yi)log(1pi)].\ell(\beta) = \sum_i \left[ y_i \log p_i + (1-y_i)\log(1-p_i) \right]. Unlike OLS there is no closed form, so we maximize (β)\ell(\beta) numerically by maximum likelihood. The standard algorithm is Newton–Raphson, which for this model takes the form of iteratively reweighted least squares (IRLS): repeatedly solving a weighted linear regression until the coefficients converge.

Deviance and fit

Model fit is summarized by the deviance, D=2(β^)D = -2\ell(\hat\beta), the logistic analogue of the residual sum of squares. Comparing the deviance of nested models gives a likelihood-ratio test for whether added predictors improve fit. Coefficient standard errors come from the curvature of the log-likelihood, yielding Wald tests and confidence intervals for each βj\beta_j (and, after exponentiating, for each odds ratio).

Worked example: reading a coefficient

Suppose a model of infection uses a single predictor, hours of exposure, and returns β^0=2.0\hat\beta_0 = -2.0 and β^1=0.7\hat\beta_1 = 0.7. The slope’s odds ratio is e0.72.01e^{0.7}\approx 2.01: each additional hour of exposure roughly doubles the odds of infection.

Now convert a linear predictor to a probability for someone with 33 hours of exposure. The linear predictor is xβ=2.0+0.7(3)=0.1x^\top\beta = -2.0 + 0.7(3) = 0.1. Then p=11+e0.10.525,p = \frac{1}{1+e^{-0.1}} \approx 0.525, about a 52.5%52.5\% predicted probability of infection.

In code

R

set.seed(1)
x <- rnorm(300)
p <- 1 / (1 + exp(-(-0.5 + 1.2 * x)))
y <- rbinom(300, 1, p)

fit <- glm(y ~ x, family = binomial)
summary(fit)                 # coefficients on log-odds scale
exp(coef(fit))               # odds ratios; slope ~ exp(1.2) ~ 3.3
predict(fit, type = "response")[1:5]   # fitted probabilities

Python

import numpy as np
import statsmodels.api as sm

rng = np.random.default_rng(1)
x = rng.normal(size=300)
p = 1 / (1 + np.exp(-(-0.5 + 1.2 * x)))
y = rng.binomial(1, p)

X = sm.add_constant(x)
fit = sm.Logit(y, X).fit()
print(fit.params)            # log-odds scale
print(np.exp(fit.params))    # odds ratios; slope ~ exp(1.2) ~ 3.3
print(fit.predict(X)[:5])    # fitted probabilities
Optimization terminated successfully.
         Current function value: 0.530359
         Iterations 6
[-0.52992636  1.46561641]
[0.58864832 4.3302116 ]
[0.49414214 0.6624543  0.48859389 0.08018361 0.68932769]

Julia

using GLM, DataFrames, Random
Random.seed!(1)
x = randn(300)
p = 1 ./ (1 .+ exp.(-(-0.5 .+ 1.2 .* x)))
y = rand.(Bernoulli.(p))
df = DataFrame(x = x, y = Int.(y))

fit = glm(@formula(y ~ x), df, Binomial(), LogitLink())
coef(fit)              # log-odds scale
exp.(coef(fit))        # odds ratios; slope ~ exp(1.2) ~ 3.3
predict(fit)[1:5]      # fitted probabilities

Why it matters

Logistic regression is the default model whenever the outcome is yes/no, and its odds ratios are the currency of epidemiological risk reporting. It scales from a single exposure to genome-wide association studies (GWAS), where each variant’s effect on disease status is fit as a logistic coefficient across millions of tests.