Survival Analysis
Survival analysis models the time until an event happens: time to death, time to infection, time to clearance of a pathogen, or time to relapse. Its defining feature is censoring — for some subjects the event has not occurred by the end of follow-up, so we know only that their survival time exceeds some value.
Time-to-event data and censoring
Let be a random time until the event of interest. In most studies we do not observe every exactly. A subject who is still event-free when the study ends, or who drops out, is right-censored: we know is larger than the last time we saw them, but not by how much.
Throwing censored subjects away, or treating their last-seen time as an event, both bias the results. The methods below are built to use censored observations correctly — a censored subject contributes information right up to the moment they leave the risk set.
The core functions
Everything in survival analysis can be written in terms of four interchangeable functions.
The survival function is the probability of surviving beyond time : where is the cumulative distribution function. It starts at and decreases toward .
The probability density describes how event times are distributed (see derivatives).
The hazard is the instantaneous event rate given survival so far: The conditioning on is what makes the hazard the natural quantity for time-to-event data: it is the risk faced by those still at risk.
The cumulative hazard integrates the hazard over time (see integrals):
The fundamental identity
These are not four independent objects. Because , integrating both sides gives Know any one of , , , or and you know all four.
Parametric models
The exponential model
The simplest model assumes a constant hazard . Then and, by the identity above, so event times follow the exponential distribution with rate . A constant hazard means the process is memoryless: the risk of the event next month does not depend on how long a subject has already survived.
The Weibull generalization
Real hazards often rise or fall over time (mortality climbs with age; post-surgical risk falls as patients recover). The Weibull model relaxes the constant-hazard assumption to a power of time, which gives an increasing hazard for shape , a decreasing hazard for , and reduces to the exponential when .
The Kaplan–Meier estimator
Often we do not want to assume any parametric form. The Kaplan–Meier (product-limit) estimator is a nonparametric estimate of that handles censoring automatically.
Order the distinct event times . At each let be the number of events and the number at risk (still under observation just before ). The estimate is a product of conditional survival probabilities: The result is a right-continuous step function that drops only at observed event times. Censored subjects never cause a drop; they simply reduce the risk set at later times, which is exactly how their partial information enters.
Comparing groups: the log-rank test
To ask whether two groups (say, treated vs. control) have different survival, compare their Kaplan–Meier curves with the log-rank test. At each event time it contrasts the observed number of events in a group with the number expected if the two groups shared a common hazard, then accumulates these into a single chi-squared statistic. It is a hypothesis test of the null that the two survival curves are the same, and it is most powerful when hazards are proportional (the assumption behind Cox regression).
Worked example: Kaplan–Meier by hand
Six patients are followed (months); a + marks a right-censored time: So there are events at 3, 5 (two of them), and 9, with censoring at 7 and 12.
We build the estimate one event time at a time.
| at risk | events | |||
|---|---|---|---|---|
| 3 | 6 | 1 | ||
| 5 | 5 | 2 | ||
| 9 | 2 | 1 |
Reading off the risk set: at one patient has already had the event (the one at 3), leaving . By the events at 3 and 5 (three patients) and the censored patient at 7 have all left, leaving only at risk. The censored time at 7 produces no drop in ; it only shrinks the later risk set.
The resulting curve is and because the last observation (at 12) is censored, the estimate stays at rather than falling to zero.
In code
R
library(survival)
# time = follow-up time, status = 1 event, 0 censored
time <- c(3, 5, 5, 7, 9, 12)
status <- c(1, 1, 1, 0, 1, 0)
fit <- survfit(Surv(time, status) ~ 1)
summary(fit)
# survival at t=3 -> 0.833, t=5 -> 0.500, t=9 -> 0.250 (matches the table)
# Two-group comparison with the log-rank test
grp <- c("A","A","B","A","B","B")
survdiff(Surv(time, status) ~ grp) # chi-squared statistic and p-value
Python
import numpy as np
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test
time = np.array([3, 5, 5, 7, 9, 12])
status = np.array([1, 1, 1, 0, 1, 0]) # 1 = event, 0 = censored
kmf = KaplanMeierFitter()
kmf.fit(time, event_observed=status)
print(kmf.survival_function_) # 0.833 at 3, 0.500 at 5, 0.250 at 9
grp = np.array(["A","A","B","A","B","B"])
res = logrank_test(time[grp=="A"], time[grp=="B"],
status[grp=="A"], status[grp=="B"])
print(res.p_value)
Julia
using Survival
time = [3, 5, 5, 7, 9, 12]
status = Bool[1, 1, 1, 0, 1, 0] # true = event, false = censored
km = fit(KaplanMeier, time, status)
# km.survival holds the step estimates: 0.833, 0.500, 0.250
println(km.survival)
Why it matters
Survival analysis is the standard toolkit whenever the outcome is how long until something happens and follow-up is incomplete — clinical trials, infectious-disease natural history, reliability engineering. The survival, hazard, and cumulative-hazard functions are a single object viewed three ways, and the Kaplan–Meier estimator plus the log-rank test give an assumption-light description and comparison of survival curves that respect censoring. These ideas are the foundation for regression models such as Cox proportional hazards.