Linear Regression

Linear regression models a continuous outcome as a straight-line function of one or more predictors. It is the workhorse of quantitative data analysis: estimating how exposures, doses, or covariates shift an average response, and providing the foundation for the whole family of regression models used in epidemiology.

Ordinary least squares fits the line that minimizes the squared residuals.

The model

We assume each outcome yiy_i is a linear function of predictors plus random noise: y=Xβ+ε,εN(0,σ2I).y = X\beta + \varepsilon,\qquad \varepsilon\sim\mathcal N(0,\sigma^2 I). Here yy is the n×1n\times 1 vector of responses, XX is the n×pn\times p design matrix (with a column of ones for the intercept), β\beta is the p×1p\times 1 vector of coefficients, and ε\varepsilon is mean-zero error. For a single observation this reads yi=xiβ+εiy_i = x_i^\top\beta + \varepsilon_i.

Ordinary least squares

Ordinary least squares (OLS) chooses β^\hat\beta to minimize the sum of squared residuals: β^=argminβi(yixiβ)2=argminβyXβ2.\hat\beta = \arg\min_\beta \sum_i \left(y_i - x_i^\top\beta\right)^2 = \arg\min_\beta \lVert y - X\beta\rVert^2. Setting the gradient to zero gives the normal equations XXβ^=XyX^\top X\,\hat\beta = X^\top y, whose closed-form solution is β^=(XX)1Xy.\hat\beta = (X^\top X)^{-1}X^\top y. This uses standard matrix operations and the inverse of the p×pp\times p matrix XXX^\top X, which exists as long as the columns of XX are not collinear.

Interpreting coefficients

Each slope β^j\hat\beta_j is the expected change in yy per one-unit increase in xjx_j, holding the other predictors fixed. The intercept β^0\hat\beta_0 is the expected yy when all predictors are zero. The coefficient of determination R2=1i(yiy^i)2i(yiyˉ)2R^2 = 1 - \frac{\sum_i (y_i - \hat y_i)^2}{\sum_i (y_i - \bar y)^2} is the fraction of the outcome’s variance explained by the model, ranging from 00 to 11.

Assumptions

OLS is unbiased and efficient when four conditions hold:

Residuals ri=yiy^ir_i = y_i - \hat y_i are the diagnostic tool: plotting them against fitted values should show a formless band with no trend or fanning.

Inference for coefficients

The estimated coefficient covariance is Var^(β^)=σ^2(XX)1\widehat{\operatorname{Var}}(\hat\beta) = \hat\sigma^2 (X^\top X)^{-1}, where σ^2=1npiri2\hat\sigma^2 = \frac{1}{n-p}\sum_i r_i^2. The square roots of its diagonal are the standard errors SE(β^j)\operatorname{SE}(\hat\beta_j). A hypothesis test of H0:βj=0H_0:\beta_j=0 uses the tt statistic t=β^j/SE(β^j)t = \hat\beta_j / \operatorname{SE}(\hat\beta_j), compared to a tt distribution with npn-p degrees of freedom. A confidence interval for βj\beta_j is β^j±t1α/2,npSE(β^j)\hat\beta_j \pm t_{1-\alpha/2,\,n-p}\,\operatorname{SE}(\hat\beta_j).

Under the normal-error assumption, OLS coincides exactly with maximum likelihood: minimizing squared error is the same as maximizing the Gaussian log-likelihood.

Worked example: simple linear regression by hand

With a single predictor, the estimates have a clean closed form. The slope is the ratio of the covariance of xx and yy to the variance of xx: β^1=Cov(x,y)Var(x)=i(xixˉ)(yiyˉ)i(xixˉ)2,β^0=yˉβ^1xˉ.\hat\beta_1 = \frac{\operatorname{Cov}(x,y)}{\operatorname{Var}(x)} = \frac{\sum_i (x_i-\bar x)(y_i-\bar y)}{\sum_i (x_i-\bar x)^2},\qquad \hat\beta_0 = \bar y - \hat\beta_1\bar x. Take the four points (1,2),(2,2),(3,4),(4,5)(1,2),(2,2),(3,4),(4,5). Then xˉ=2.5\bar x = 2.5 and yˉ=3.25\bar y = 3.25. The cross-products give (xixˉ)(yiyˉ)=(1.5)(1.25)+(0.5)(1.25)+(0.5)(0.75)+(1.5)(1.75)=5.5\sum (x_i-\bar x)(y_i-\bar y) = (-1.5)(-1.25)+(-0.5)(-1.25)+(0.5)(0.75)+(1.5)(1.75) = 5.5. The spread gives (xixˉ)2=2.25+0.25+0.25+2.25=5\sum (x_i-\bar x)^2 = 2.25+0.25+0.25+2.25 = 5. So β^1=5.5/5=1.1\hat\beta_1 = 5.5/5 = 1.1 and β^0=3.251.1(2.5)=0.5\hat\beta_0 = 3.25 - 1.1(2.5) = 0.5. The fitted line is y^=0.5+1.1x\hat y = 0.5 + 1.1\,x.

In code

R

set.seed(1)
x <- 1:20
y <- 0.5 + 1.1 * x + rnorm(20, sd = 1.5)

fit <- lm(y ~ x)
summary(fit)          # coefficients, SEs, t-tests, R^2
confint(fit)          # 95% CIs for the coefficients

# closed form via matrix ops
X <- cbind(1, x)
beta_hat <- solve(t(X) %*% X, t(X) %*% y)
beta_hat               # matches coef(fit): ~0.5 and ~1.1

Python

import numpy as np
import statsmodels.api as sm

rng = np.random.default_rng(1)
x = np.arange(1, 21)
y = 0.5 + 1.1 * x + rng.normal(0, 1.5, size=20)

X = sm.add_constant(x)
fit = sm.OLS(y, X).fit()
print(fit.params)          # intercept ~0.5, slope ~1.1
print(fit.conf_int())      # 95% CIs

# closed form: beta = (X'X)^-1 X'y
beta_hat = np.linalg.solve(X.T @ X, X.T @ y)
print(beta_hat)            # same as fit.params
[1.1011522  1.04810273]
[[0.26542075 1.93688366]
 [0.97833722 1.11786825]]
[1.1011522  1.04810273]

Julia

using GLM, DataFrames, Random
Random.seed!(1)
x = 1:20
y = 0.5 .+ 1.1 .* x .+ randn(20) .* 1.5
df = DataFrame(x = x, y = y)

fit = lm(@formula(y ~ x), df)
coef(fit)            # intercept ~0.5, slope ~1.1
confint(fit)         # 95% CIs

# closed form
X = [ones(20) collect(x)]
beta_hat = (X' * X) \ (X' * y)   # matches coef(fit)

Why it matters

Linear regression turns messy scatter into an interpretable slope with a standard error, letting analysts quantify associations and adjust for confounders. It is also the conceptual template for logistic regression and the broader class of generalized linear models, which extend the same machinery to binary and count outcomes.