Linear Regression
Linear regression models a continuous outcome as a straight-line function of one or more predictors. It is the workhorse of quantitative data analysis: estimating how exposures, doses, or covariates shift an average response, and providing the foundation for the whole family of regression models used in epidemiology.
The model
We assume each outcome is a linear function of predictors plus random noise: Here is the vector of responses, is the design matrix (with a column of ones for the intercept), is the vector of coefficients, and is mean-zero error. For a single observation this reads .
Ordinary least squares
Ordinary least squares (OLS) chooses to minimize the sum of squared residuals: Setting the gradient to zero gives the normal equations , whose closed-form solution is This uses standard matrix operations and the inverse of the matrix , which exists as long as the columns of are not collinear.
Interpreting coefficients
Each slope is the expected change in per one-unit increase in , holding the other predictors fixed. The intercept is the expected when all predictors are zero. The coefficient of determination is the fraction of the outcome’s variance explained by the model, ranging from to .
Assumptions
OLS is unbiased and efficient when four conditions hold:
- Linearity: the mean of is truly linear in the predictors.
- Independent errors: the are uncorrelated across observations.
- Homoscedasticity: the errors have constant variance .
- Normality: the errors are approximately normal (needed mainly for exact small-sample inference).
Residuals are the diagnostic tool: plotting them against fitted values should show a formless band with no trend or fanning.
Inference for coefficients
The estimated coefficient covariance is , where . The square roots of its diagonal are the standard errors . A hypothesis test of uses the statistic , compared to a distribution with degrees of freedom. A confidence interval for is .
Under the normal-error assumption, OLS coincides exactly with maximum likelihood: minimizing squared error is the same as maximizing the Gaussian log-likelihood.
Worked example: simple linear regression by hand
With a single predictor, the estimates have a clean closed form. The slope is the ratio of the covariance of and to the variance of : Take the four points . Then and . The cross-products give . The spread gives . So and . The fitted line is .
In code
R
set.seed(1)
x <- 1:20
y <- 0.5 + 1.1 * x + rnorm(20, sd = 1.5)
fit <- lm(y ~ x)
summary(fit) # coefficients, SEs, t-tests, R^2
confint(fit) # 95% CIs for the coefficients
# closed form via matrix ops
X <- cbind(1, x)
beta_hat <- solve(t(X) %*% X, t(X) %*% y)
beta_hat # matches coef(fit): ~0.5 and ~1.1
Python
import numpy as np
import statsmodels.api as sm
rng = np.random.default_rng(1)
x = np.arange(1, 21)
y = 0.5 + 1.1 * x + rng.normal(0, 1.5, size=20)
X = sm.add_constant(x)
fit = sm.OLS(y, X).fit()
print(fit.params) # intercept ~0.5, slope ~1.1
print(fit.conf_int()) # 95% CIs
# closed form: beta = (X'X)^-1 X'y
beta_hat = np.linalg.solve(X.T @ X, X.T @ y)
print(beta_hat) # same as fit.params
[1.1011522 1.04810273]
[[0.26542075 1.93688366]
[0.97833722 1.11786825]]
[1.1011522 1.04810273]
Julia
using GLM, DataFrames, Random
Random.seed!(1)
x = 1:20
y = 0.5 .+ 1.1 .* x .+ randn(20) .* 1.5
df = DataFrame(x = x, y = y)
fit = lm(@formula(y ~ x), df)
coef(fit) # intercept ~0.5, slope ~1.1
confint(fit) # 95% CIs
# closed form
X = [ones(20) collect(x)]
beta_hat = (X' * X) \ (X' * y) # matches coef(fit)
Why it matters
Linear regression turns messy scatter into an interpretable slope with a standard error, letting analysts quantify associations and adjust for confounders. It is also the conceptual template for logistic regression and the broader class of generalized linear models, which extend the same machinery to binary and count outcomes.