Survival Analysis

Survival analysis models the time until an event happens: time to death, time to infection, time to clearance of a pathogen, or time to relapse. Its defining feature is censoring — for some subjects the event has not occurred by the end of follow-up, so we know only that their survival time exceeds some value.

Kaplan–Meier survival curves for two groups with right-censored data.

Time-to-event data and censoring

Let T0T\ge 0 be a random time until the event of interest. In most studies we do not observe every TT exactly. A subject who is still event-free when the study ends, or who drops out, is right-censored: we know TT is larger than the last time we saw them, but not by how much.

Throwing censored subjects away, or treating their last-seen time as an event, both bias the results. The methods below are built to use censored observations correctly — a censored subject contributes information right up to the moment they leave the risk set.

The core functions

Everything in survival analysis can be written in terms of four interchangeable functions.

The survival function is the probability of surviving beyond time tt: S(t)=Pr(T>t)=1F(t),S(t) = \Pr(T > t) = 1 - F(t), where FF is the cumulative distribution function. It starts at S(0)=1S(0)=1 and decreases toward 00.

The probability density f(t)=S(t)f(t) = -S'(t) describes how event times are distributed (see derivatives).

The hazard is the instantaneous event rate given survival so far: h(t)=f(t)S(t)=limΔt0Pr(tT<t+ΔtTt)Δt.h(t) = \frac{f(t)}{S(t)} = \lim_{\Delta t \to 0} \frac{\Pr(t \le T < t+\Delta t \mid T \ge t)}{\Delta t}. The conditioning on TtT \ge t is what makes the hazard the natural quantity for time-to-event data: it is the risk faced by those still at risk.

The cumulative hazard integrates the hazard over time (see integrals): H(t)=0th(u)du.H(t) = \int_0^t h(u)\,du.

The fundamental identity

These are not four independent objects. Because h(t)=f(t)/S(t)=S(t)/S(t)=ddtlogS(t)h(t) = f(t)/S(t) = -S'(t)/S(t) = -\dfrac{d}{dt}\log S(t), integrating both sides gives H(t)=logS(t)S(t)=eH(t).H(t) = -\log S(t) \qquad\Longleftrightarrow\qquad S(t) = e^{-H(t)}. Know any one of SS, ff, hh, or HH and you know all four.

Parametric models

The exponential model

The simplest model assumes a constant hazard h(t)=λh(t) = \lambda. Then H(t)=λtH(t) = \lambda t and, by the identity above, S(t)=eλt,S(t) = e^{-\lambda t}, so event times follow the exponential distribution with rate λ\lambda. A constant hazard means the process is memoryless: the risk of the event next month does not depend on how long a subject has already survived.

The Weibull generalization

Real hazards often rise or fall over time (mortality climbs with age; post-surgical risk falls as patients recover). The Weibull model relaxes the constant-hazard assumption to a power of time, h(t)=λptp1,h(t) = \lambda p\, t^{\,p-1}, which gives an increasing hazard for shape p>1p>1, a decreasing hazard for p<1p<1, and reduces to the exponential when p=1p=1.

The Kaplan–Meier estimator

Often we do not want to assume any parametric form. The Kaplan–Meier (product-limit) estimator is a nonparametric estimate of S(t)S(t) that handles censoring automatically.

Order the distinct event times t1<t2<t_1 < t_2 < \cdots. At each tit_i let did_i be the number of events and nin_i the number at risk (still under observation just before tit_i). The estimate is a product of conditional survival probabilities: S^(t)=tit(1dini).\hat S(t) = \prod_{t_i \le t}\left(1 - \frac{d_i}{n_i}\right). The result is a right-continuous step function that drops only at observed event times. Censored subjects never cause a drop; they simply reduce the risk set nin_i at later times, which is exactly how their partial information enters.

Comparing groups: the log-rank test

To ask whether two groups (say, treated vs. control) have different survival, compare their Kaplan–Meier curves with the log-rank test. At each event time it contrasts the observed number of events in a group with the number expected if the two groups shared a common hazard, then accumulates these into a single chi-squared statistic. It is a hypothesis test of the null that the two survival curves are the same, and it is most powerful when hazards are proportional (the assumption behind Cox regression).

Worked example: Kaplan–Meier by hand

Six patients are followed (months); a + marks a right-censored time: 3,5,5,7+,9,12+.3,\quad 5,\quad 5,\quad 7^{+},\quad 9,\quad 12^{+}. So there are events at 3, 5 (two of them), and 9, with censoring at 7 and 12. We build the estimate one event time at a time.

tit_iat risk nin_ievents did_i1dini1-\dfrac{d_i}{n_i}S^(ti)\hat S(t_i)
3615/65/60.8330.833
5523/53/50.833×0.6=0.5000.833\times0.6=0.500
9211/21/20.500×0.5=0.2500.500\times0.5=0.250

Reading off the risk set: at t=5t=5 one patient has already had the event (the one at 3), leaving n2=5n_2=5. By t=9t=9 the events at 3 and 5 (three patients) and the censored patient at 7 have all left, leaving only n3=2n_3=2 at risk. The censored time at 7 produces no drop in S^\hat S; it only shrinks the later risk set.

The resulting curve is S^(t)={10t<3,0.8333t<5,0.5005t<9,0.250t9,\hat S(t) = \begin{cases} 1 & 0 \le t < 3,\\ 0.833 & 3 \le t < 5,\\ 0.500 & 5 \le t < 9,\\ 0.250 & t \ge 9, \end{cases} and because the last observation (at 12) is censored, the estimate stays at 0.2500.250 rather than falling to zero.

In code

R

library(survival)

# time = follow-up time, status = 1 event, 0 censored
time   <- c(3, 5, 5, 7, 9, 12)
status <- c(1, 1, 1, 0, 1, 0)

fit <- survfit(Surv(time, status) ~ 1)
summary(fit)
# survival at t=3 -> 0.833, t=5 -> 0.500, t=9 -> 0.250 (matches the table)

# Two-group comparison with the log-rank test
grp <- c("A","A","B","A","B","B")
survdiff(Surv(time, status) ~ grp)   # chi-squared statistic and p-value

Python

import numpy as np
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test

time   = np.array([3, 5, 5, 7, 9, 12])
status = np.array([1, 1, 1, 0, 1, 0])   # 1 = event, 0 = censored

kmf = KaplanMeierFitter()
kmf.fit(time, event_observed=status)
print(kmf.survival_function_)   # 0.833 at 3, 0.500 at 5, 0.250 at 9

grp = np.array(["A","A","B","A","B","B"])
res = logrank_test(time[grp=="A"], time[grp=="B"],
                   status[grp=="A"], status[grp=="B"])
print(res.p_value)

Julia

using Survival

time   = [3, 5, 5, 7, 9, 12]
status = Bool[1, 1, 1, 0, 1, 0]   # true = event, false = censored

km = fit(KaplanMeier, time, status)
# km.survival holds the step estimates: 0.833, 0.500, 0.250
println(km.survival)

Why it matters

Survival analysis is the standard toolkit whenever the outcome is how long until something happens and follow-up is incomplete — clinical trials, infectious-disease natural history, reliability engineering. The survival, hazard, and cumulative-hazard functions are a single object viewed three ways, and the Kaplan–Meier estimator plus the log-rank test give an assumption-light description and comparison of survival curves that respect censoring. These ideas are the foundation for regression models such as Cox proportional hazards.