Cox Proportional Hazards Regression
Cox regression relates covariates — treatment, age, viral load — to the time until an event such as death, infection, or clearance. It is the most widely used regression model for censored time-to-event data, and its coefficients translate directly into hazard ratios.
The model
Cox regression is a model for the hazard, the instantaneous event rate among those still at risk. For a subject with covariate vector it assumes Two pieces multiply together:
- a baseline hazard that captures how risk changes over time, shared by everyone;
- a relative risk that scales that baseline up or down according to the covariates.
The model is called semiparametric because is left completely unspecified — we never write down a formula for it — while the covariate effect has a parametric (log-linear) form. The exponential link keeps the hazard positive for any values of , much as it does in logistic regression.
Hazard ratios
The coefficients are interpreted through exponentiation. Compare two subjects differing by one unit in covariate and identical otherwise. The baseline cancels, leaving the hazard ratio a single number that does not depend on . So is the multiplicative effect of a one-unit increase in on the instantaneous risk:
- (): higher hazard, shorter survival;
- (): no effect;
- (): lower hazard, longer survival.
The proportional-hazards assumption
Because the baseline cancels, the hazard ratio between any two covariate profiles is constant over time — this is the proportional-hazards assumption. Graphically, the hazard for one group is a fixed multiple of the hazard for another at every ; their hazard curves never cross.
The assumption can fail — for instance, a surgery with high early risk but a long-term survival benefit has a hazard ratio that starts above 1 and falls below it. Common checks include:
- testing whether scaled Schoenfeld residuals trend with time (a nonzero slope signals a time-varying effect);
- inspecting plots across groups, which should be roughly parallel;
- adding a covariate-by-time interaction and testing whether it is nonzero (a hypothesis test of proportionality).
When the assumption is untenable, remedies include stratifying on the offending variable or fitting explicitly time-varying coefficients.
Estimation by partial likelihood
How can we estimate without ever specifying ? Cox’s insight was the partial likelihood. At each observed event time , condition on the fact that one event happened among the risk set (everyone still under observation), and ask which subject it was. Under the model, the probability that the subject who actually failed, with covariates , is the one to fail is The baseline hazard cancels from every term, so it disappears entirely. Multiplying these contributions over all event times gives the partial likelihood which is maximized over by maximum likelihood methods. Only the order of the event times matters, not their spacing, which is precisely why need never be modeled. When events are far enough apart that the constant-hazard picture holds locally, the model connects back to the exponential distribution.
Worked example: reading a coefficient
Suppose a trial fits a single covariate treatment (0 = placebo, 1 = drug) and reports .
The hazard ratio is Patients on the drug face about a 34% lower instantaneous risk of the event at any given time ().
Now suppose a covariate stage has , giving .
Each one-step increase in disease stage multiplies the hazard by — a 50% higher instantaneous risk.
A 95% confidence interval that excludes (equivalently ) indicates a statistically significant effect.
Note that a hazard ratio describes the rate of the event, not a difference in mean survival time directly.
In code
R
library(survival)
# Built-in ovarian cancer data: futime = time, fustat = event indicator
fit <- coxph(Surv(futime, fustat) ~ age + rx, data = ovarian)
summary(fit) # coef, exp(coef) = hazard ratio, and p-values
# Check proportional hazards via scaled Schoenfeld residuals
cox.zph(fit) # a small p-value flags a PH violation
Python
from lifelines import CoxPHFitter
from lifelines.datasets import load_rossi
df = load_rossi() # 'week' = time, 'arrest' = event indicator
cph = CoxPHFitter()
cph.fit(df, duration_col="week", event_col="arrest")
cph.print_summary() # coef and exp(coef) = hazard ratio per covariate
cph.check_assumptions(df) # proportional-hazards diagnostics
Julia
using Survival, DataFrames
# EventTime wraps (time, event-occurred?) for each subject
df = DataFrame(time = [4, 6, 8, 10, 12, 14],
status = Bool[1, 0, 1, 1, 0, 1],
x = [0, 1, 0, 1, 1, 0])
df.et = EventTime.(df.time, df.status)
model = coxph(@formula(et ~ x), df)
coef(model) # beta; exp.(coef(model)) gives hazard ratios
Why it matters
Cox regression is the default tool for quantifying how covariates affect survival while accounting for censoring, without committing to a shape for the baseline hazard. Its output — hazard ratios with confidence intervals — is the standard language of clinical and epidemiological reporting, from treatment effects to prognostic factors. Understanding the proportional-hazards assumption and the partial-likelihood machinery is what lets you fit these models responsibly and know when they break.