Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Foundations

Survival Analysis

Modeling time-to-event data with censoring: Kaplan-Meier curves, hazard functions, and the Cox proportional hazards model.

AdvancedTier 2Stable~55 min
0

Why This Matters

Many important questions in ML and beyond are about when something happens, not just whether it happens. When will a customer churn? How long until a machine fails? When will a patient relapse?

The twist: you often do not observe the event for everyone. A clinical trial ends before all patients have relapsed. A customer is still active at the end of your observation window. This is censoring, and it makes standard regression inapplicable. Survival analysis is the framework for handling time-to-event data with censoring, and it appears throughout ML in churn prediction, reliability engineering, and clinical AI.

Mental Model

Imagine tracking 100 lightbulbs, recording when each burns out. After one year, 70 have burned out but 30 are still working. You cannot simply ignore those 30 (that biases your failure time estimate downward) or pretend they failed at day 365 (also biased). You need to use the partial information: those 30 survived at least 365 days.

Survival analysis makes principled use of this partial information. The Kaplan-Meier estimator and the Cox model both handle censoring correctly.

Formal Setup and Notation

Let TT be a non-negative random variable representing the true event time. Let CC be a non-negative random variable representing the censoring time. We observe (Y,Δ)(Y, \Delta) where Y=min(T,C)Y = \min(T, C) is the observed time and Δ=1(TC)\Delta = \mathbf{1}(T \leq C) is the event indicator (11 if the event was observed, 00 if censored).

Definition

Survival Function

The survival function is the probability of surviving beyond time tt:

S(t)=P(T>t)=1F(t)S(t) = P(T > t) = 1 - F(t)

where F(t)=P(Tt)F(t) = P(T \leq t) is the cumulative distribution function of TT. The survival function is non-increasing with S(0)=1S(0) = 1.

Definition

Hazard Function

The hazard function (or hazard rate) is the instantaneous rate of the event at time tt, given survival up to tt:

h(t)=limΔt0P(tT<t+ΔtTt)Δt=f(t)S(t)h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t} = \frac{f(t)}{S(t)}

where f(t)f(t) is the density of TT. The hazard can increase, decrease, or be constant over time.

Definition

Cumulative Hazard

The cumulative hazard is:

H(t)=0th(u)du=logS(t)H(t) = \int_0^t h(u) \, du = -\log S(t)

This connects the hazard to the survival function via S(t)=exp(H(t))S(t) = \exp(-H(t)).

Core Definitions

Right censoring is the most common type: you know the subject survived at least until time CC, but you do not know the true event time. This happens when the study ends, the subject drops out, or they experience a competing event.

The key assumption for most survival methods is non-informative censoring: the censoring mechanism is independent of the event process, conditional on covariates. If sicker patients drop out more, censoring is informative and standard methods can be biased.

Definition

Kaplan-Meier Estimator

Let t1<t2<<tKt_1 < t_2 < \cdots < t_K be the distinct observed event times. At each tjt_j, let djd_j be the number of events and njn_j the number of subjects at risk (alive and uncensored just before tjt_j). The Kaplan-Meier estimator of the survival function is:

S^(t)=j:tjt(1djnj)\hat{S}(t) = \prod_{j: t_j \leq t} \left(1 - \frac{d_j}{n_j}\right)

This is a step function that drops at each observed event time.

Definition

Cox Proportional Hazards Model

The Cox model specifies the hazard for a subject with covariates xx as:

h(tx)=h0(t)exp(βTx)h(t \mid x) = h_0(t) \exp(\beta^T x)

where h0(t)h_0(t) is an unspecified baseline hazard and β\beta are regression coefficients. The model is semiparametric: h0(t)h_0(t) is left completely unspecified, and inference on β\beta uses partial likelihood.

Main Theorems

Theorem

Consistency of the Kaplan-Meier Estimator

Statement

Under non-informative right censoring with nn i.i.d. observations, the Kaplan-Meier estimator is uniformly consistent: for any τ\tau such that P(Yτ)>0P(Y \geq \tau) > 0,

suptτS^(t)S(t)a.s.0 as n\sup_{t \leq \tau} |\hat{S}(t) - S(t)| \xrightarrow{a.s.} 0 \text{ as } n \to \infty

Furthermore, n(S^(t)S(t))\sqrt{n}(\hat{S}(t) - S(t)) converges to a Gaussian process, and the pointwise variance is given by Greenwood's formula:

Var^(S^(t))=S^(t)2j:tjtdjnj(njdj)\widehat{\text{Var}}(\hat{S}(t)) = \hat{S}(t)^2 \sum_{j: t_j \leq t} \frac{d_j}{n_j(n_j - d_j)}

Intuition

Each factor (1dj/nj)(1 - d_j/n_j) in the Kaplan-Meier product estimates the conditional probability of surviving past tjt_j given survival to tjt_j. By the law of large numbers, each factor converges to the true conditional survival probability. The product of consistent factors is consistent.

Proof Sketch

Write S^(t)=(1dj/nj)\hat{S}(t) = \prod (1 - d_j/n_j) and take logs: logS^(t)=log(1dj/nj)\log \hat{S}(t) = \sum \log(1 - d_j/n_j). Each term is approximately dj/nj-d_j/n_j, which estimates the discrete hazard increment. By the Glivenko- Cantelli theorem applied to the Nelson-Aalen estimator H^(t)=dj/nj\hat{H}(t) = \sum d_j/n_j, the cumulative hazard converges uniformly. Then S^(t)=exp(H^(t))\hat{S}(t) = \exp(-\hat{H}(t)) converges uniformly via the continuous mapping theorem.

Why It Matters

The Kaplan-Meier estimator is the nonparametric workhorse of survival analysis. This theorem guarantees it works: given enough data, it accurately recovers the true survival curve even when some observations are censored.

Failure Mode

The estimator becomes unreliable in the right tail where few subjects remain at risk (njn_j is small). Confidence intervals widen dramatically. The estimator is undefined beyond the largest observed time if that observation is censored.

Theorem

Cox Partial Likelihood

Statement

Under the Cox model h(tx)=h0(t)exp(βTx)h(t \mid x) = h_0(t) \exp(\beta^T x), the regression coefficients β\beta can be estimated by maximizing the partial likelihood:

L(β)=j=1Kexp(βTx(j))iRjexp(βTxi)L(\beta) = \prod_{j=1}^{K} \frac{\exp(\beta^T x_{(j)})}{\sum_{i \in \mathcal{R}_j} \exp(\beta^T x_i)}

where x(j)x_{(j)} is the covariate vector of the subject who failed at tjt_j and Rj={i:Yitj}\mathcal{R}_j = \{i : Y_i \geq t_j\} is the risk set at time tjt_j. The maximum partial likelihood estimator β^\hat{\beta} is consistent and asymptotically normal.

Intuition

At each event time, the partial likelihood asks: given that someone in the risk set failed, what is the probability it was the subject who actually failed? This depends only on β\beta (through the relative hazards), not on h0(t)h_0(t). By conditioning on the event times, we eliminate the nuisance parameter h0(t)h_0(t).

Proof Sketch

Condition the full likelihood on the observed event times and the number of events at each time. The baseline hazard h0(t)h_0(t) cancels in the conditional probability because it appears in both numerator and denominator. The resulting expression is Cox's partial likelihood. Consistency and asymptotic normality follow from general theory of estimating equations (the partial likelihood score is an unbiased estimating equation).

Why It Matters

The Cox model is the most widely used regression model in survival analysis because it avoids specifying the baseline hazard. You can estimate covariate effects β\beta without making any parametric assumption about how the baseline risk changes over time. This flexibility is why it dominates clinical trials and industrial reliability.

Failure Mode

The proportional hazards assumption requires that hazard ratios between groups are constant over time. If treatment A is better early but worse late, the Cox model gives a misleading average effect. Always check this assumption with Schoenfeld residuals or log-log survival plots.

Canonical Examples

Example

Clinical trial with censoring

A drug trial enrolls 5 patients. Event times (death or end of study) are: patient 1: death at month 3, patient 2: censored at month 5 (dropped out), patient 3: death at month 7, patient 4: death at month 7, patient 5: censored at month 12 (study ended).

The Kaplan-Meier estimate: at t=3t = 3, S^(3)=11/5=0.8\hat{S}(3) = 1 - 1/5 = 0.8. At t=7t = 7, 3 subjects remain at risk (patients 3, 4, 5), and 2 die, so S^(7)=0.8×(12/3)=0.267\hat{S}(7) = 0.8 \times (1 - 2/3) = 0.267. Patient 2's censoring at month 5 reduces the risk set but does not create a step in the survival curve.

Common Confusions

Watch Out

Censored observations are not missing data

Censored observations carry information: the event did not happen before time CC. Dropping censored observations wastes data and biases the analysis (you would systematically underestimate survival times). The Kaplan-Meier and Cox methods use all observations.

Watch Out

The Cox model does not model the survival function directly

The Cox model models the hazard, not the survival function. To get survival predictions, you need to estimate h0(t)h_0(t) separately (e.g., via the Breslow estimator). The partial likelihood only gives you β\beta, which tells you how covariates affect relative risk.

Summary

  • Censoring means you observe the event for some subjects but not all; ignoring it introduces bias
  • The Kaplan-Meier estimator handles censoring by reducing the risk set at censoring times without counting them as events
  • The hazard h(t)h(t) is the instantaneous event rate given survival; it relates to survival via S(t)=exp(0th(u)du)S(t) = \exp(-\int_0^t h(u) du)
  • The Cox model h(tx)=h0(t)exp(βTx)h(t \mid x) = h_0(t) \exp(\beta^T x) avoids specifying the baseline hazard using partial likelihood
  • Always check the proportional hazards assumption when using the Cox model

Exercises

ExerciseCore

Problem

Five subjects have observed times Y=(2,5+,6,8+,9)Y = (2, 5^+, 6, 8^+, 9) where +^+ denotes censoring. Compute the Kaplan-Meier survival estimate at each observed event time.

ExerciseAdvanced

Problem

In the Cox model, show that the partial likelihood does not depend on the baseline hazard h0(t)h_0(t). Start from the conditional probability that subject (j)(j) fails at time tjt_j given that exactly one failure occurs at tjt_j from the risk set Rj\mathcal{R}_j.

ExerciseResearch

Problem

The proportional hazards assumption is restrictive. Describe two situations where it fails and what alternative models you might use instead.

References

Canonical:

  • Cox, Regression Models and Life-Tables (1972), JRSS Series B
  • Kalbfleisch & Prentice, The Statistical Analysis of Failure Time Data (2nd ed., 2002)

Current:

  • Kleinbaum & Klein, Survival Analysis (3rd ed., 2012)

  • Kvamme et al., Time-to-Event Prediction with Neural Networks and Cox Regression (JMLR 2019)

  • Casella & Berger, Statistical Inference (2002), Chapters 5-10

  • Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6

Next Topics

Natural continuations from survival analysis:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics