Survival Analysis

Many important questions in ML and beyond are about when something happens, not just whether it happens. When will a customer churn? How long until a machine fails? When will a patient relapse?

The twist: you often do not observe the event for everyone. A clinical trial ends before all patients have relapsed. A customer is still active at the end of your observation window. This is censoring, and it makes standard regression inapplicable. Survival analysis is the framework for handling time-to-event data with censoring, and it appears throughout ML in churn prediction, reliability engineering, and clinical AI.

Mental Model

Imagine tracking 100 lightbulbs, recording when each burns out. After one year, 70 have burned out but 30 are still working. You cannot simply ignore those 30 (that biases your failure time estimate downward) or pretend they failed at day 365 (also biased). You need to use the partial information: those 30 survived at least 365 days.

Survival analysis makes principled use of this partial information. The Kaplan-Meier estimator and the Cox model both handle censoring correctly.

Formal Setup and Notation

Let $T$ be a non-negative random variable representing the true event time. Let $C$ be a non-negative random variable representing the censoring time. We observe $(Y, \Delta)$ where $Y = \min(T, C)$ is the observed time and $\Delta = \mathbf{1}(T \leq C)$ is the event indicator ( $1$ if the event was observed, $0$ if censored).

Definition

Survival Function $S (t)$

The survival function is the probability of surviving beyond time $t$ :

$S(t) = P(T > t) = 1 - F(t)$

where $F(t) = P(T \leq t)$ is the cumulative distribution function of $T$ . The survival function is non-increasing with $S(0) = 1$ .

Definition

Hazard Function $h (t)$

The hazard function (or hazard rate) is the instantaneous rate of the event at time $t$ , given survival up to $t$ :

$h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t} = \frac{f(t)}{S(t)}$

where $f(t)$ is the density of $T$ . The hazard can increase, decrease, or be constant over time.

Definition

Cumulative Hazard $H (t)$

The cumulative hazard is:

$H(t) = \int_0^t h(u) \, du = -\log S(t)$

This connects the hazard to the survival function via $S(t) = \exp(-H(t))$ .

Core Definitions

Right censoring is the most common type: you know the subject survived at least until time $C$ , but you do not know the true event time. The clean right-censoring cases are administrative (the study ends) and dropout, both under an independent-censoring assumption.

A competing event is not ordinary right censoring. A competing event (e.g. death from another cause when studying time to relapse) is an event that prevents the event of interest from occurring, not merely a stop in observation. Treating competing events as independent right censoring with Kaplan-Meier estimates a hypothetical net-of-competing-causes survival curve, not the real-world probability of the event before competing causes, and biases cumulative-incidence estimates upward. Competing risks require dedicated tools — cause-specific hazards, the cumulative incidence function (Aalen-Johansen estimator), and Fine-Gray subdistribution-hazard models — chosen according to the target estimand.

The key assumption for most survival methods is non-informative censoring: the censoring mechanism is independent of the event process, conditional on covariates. If sicker patients drop out more, censoring is informative and standard methods can be biased.

Definition

Kaplan-Meier Estimator

Let $t_1 < t_2 < \cdots < t_K$ be the distinct observed event times. At each $t_j$ , let $d_j$ be the number of events and $n_j$ the number of subjects at risk (alive and uncensored just before $t_j$ ). The Kaplan-Meier estimator of the survival function is:

$\hat{S}(t) = \prod_{j: t_j \leq t} \left(1 - \frac{d_j}{n_j}\right)$

This is a step function that drops at each observed event time.

Definition

Cox Proportional Hazards Model

The Cox model specifies the hazard for a subject with covariates $x$ as:

$h(t \mid x) = h_0(t) \exp(\beta^T x)$

where $h_0(t)$ is an unspecified baseline hazard and $\beta$ are regression coefficients. The model is semiparametric: $h_0(t)$ is left completely unspecified, and inference on $\beta$ uses partial likelihood.

Main Theorems

Theorem

Consistency of the Kaplan-Meier Estimator

Statement

Under non-informative right censoring with $n$ i.i.d. observations, the Kaplan-Meier estimator is uniformly consistent: for any $\tau$ such that $P(Y \geq \tau) > 0$ ,

$\sup_{t \leq \tau} |\hat{S}(t) - S(t)| \xrightarrow{a.s.} 0 \text{ as } n \to \infty$

Furthermore, $\sqrt{n}(\hat{S}(t) - S(t))$ converges to a Gaussian process, and the pointwise variance is given by Greenwood's formula:

$\widehat{\text{Var}}(\hat{S}(t)) = \hat{S}(t)^2 \sum_{j: t_j \leq t} \frac{d_j}{n_j(n_j - d_j)}$

Intuition

Each factor $(1 - d_j/n_j)$ in the Kaplan-Meier product estimates the conditional probability of surviving past $t_j$ given survival to $t_j$ . By the law of large numbers, each factor converges to the true conditional survival probability. The product of consistent factors is consistent.

Proof Sketch

Write $\hat{S}(t) = \prod (1 - d_j/n_j)$ and take logs: $\log \hat{S}(t) = \sum \log(1 - d_j/n_j)$ . Each term is approximately $-d_j/n_j$ , which estimates the discrete hazard increment. By the Glivenko- Cantelli theorem applied to the Nelson-Aalen estimator $\hat{H}(t) = \sum d_j/n_j$ , the cumulative hazard converges uniformly. Then $\hat{S}(t) = \exp(-\hat{H}(t))$ converges uniformly via the continuous mapping theorem.

Why It Matters

The Kaplan-Meier estimator is the nonparametric workhorse of survival analysis. This theorem guarantees it works: given enough data, it accurately recovers the true survival curve even when some observations are censored.

Failure Mode

The estimator becomes unreliable in the right tail where few subjects remain at risk ( $n_j$ is small). Confidence intervals widen dramatically. The estimator is undefined beyond the largest observed time if that observation is censored.

report a correction →

Theorem

Cox Partial Likelihood

Statement

Under the Cox model $h(t \mid x) = h_0(t) \exp(\beta^T x)$ , the regression coefficients $\beta$ can be estimated by maximizing the partial likelihood:

$L(\beta) = \prod_{j=1}^{K} \frac{\exp(\beta^T x_{(j)})}{\sum_{i \in \mathcal{R}_j} \exp(\beta^T x_i)}$

where $x_{(j)}$ is the covariate vector of the subject who failed at $t_j$ and $\mathcal{R}_j = \{i : Y_i \geq t_j\}$ is the risk set at time $t_j$ . The maximum partial likelihood estimator $\hat{\beta}$ is consistent and asymptotically normal.

Intuition

At each event time, the partial likelihood asks: given that someone in the risk set failed, what is the probability it was the subject who actually failed? This depends only on $\beta$ (through the relative hazards), not on $h_0(t)$ . By conditioning on the event times, we eliminate the nuisance parameter $h_0(t)$ .

Proof Sketch

Condition the full likelihood on the observed event times and the number of events at each time. The baseline hazard $h_0(t)$ cancels in the conditional probability because it appears in both numerator and denominator. The resulting expression is Cox's partial likelihood. Consistency and asymptotic normality follow from general theory of estimating equations (the partial likelihood score is an unbiased estimating equation).

Why It Matters

The Cox model is the most widely used regression model in survival analysis because it avoids specifying the baseline hazard. You can estimate covariate effects $\beta$ without making any parametric assumption about how the baseline risk changes over time. This flexibility is why it dominates clinical trials and industrial reliability.

Failure Mode

The proportional hazards assumption requires that hazard ratios between groups are constant over time. If treatment A is better early but worse late, the Cox model gives a misleading average effect. Always check this assumption with Schoenfeld residuals or log-log survival plots.

report a correction →

Log-Rank Test

The log-rank test is the standard nonparametric test for equality of two survival curves. Let $t_1 < t_2 < \cdots < t_K$ be the distinct event times pooled across both groups. At each $t_j$ , let $d_{1j}, d_{2j}$ be the event counts and $n_{1j}, n_{2j}$ the at-risk counts in groups 1 and 2, with totals $d_j = d_{1j} + d_{2j}$ and $n_j = n_{1j} + n_{2j}$ .

Under the null hypothesis $H_0: S_1(t) = S_2(t)$ , the expected number of events in group 1 at $t_j$ conditional on $d_j$ is $E_{1j} = d_j n_{1j} / n_j$ , with hypergeometric variance $V_{1j} = d_j (n_j - d_j) n_{1j} n_{2j} / (n_j^2 (n_j - 1))$ . The log-rank statistic is:

$Z = \frac{\sum_{j=1}^{K} (d_{1j} - E_{1j})}{\sqrt{\sum_{j=1}^{K} V_{1j}}}$

Asymptotically, $Z \sim \mathcal{N}(0, 1)$ under $H_0$ , so $Z^2 \sim \chi^2_1$ . The log-rank test is the score test from the Cox model with a single binary covariate, which makes it especially well matched to proportional-hazards alternatives — it is locally most powerful against contiguous PH alternatives. Its actual power is not "full"; it depends on effect size, censoring distribution, sample size and group allocation, and the baseline hazard. The test loses power when hazards cross, since early and late deviations partially cancel.

Accelerated Failure Time (AFT) Models

AFT models are an alternative to Cox proportional hazards. Instead of multiplying the hazard, covariates rescale time:

$\log T = \beta^T x + \epsilon$

where $\epsilon$ has a specified parametric distribution. Equivalently, $S(t \mid x) = S_0(t \cdot \exp(-\beta^T x))$ , so covariates accelerate (shrink time) or decelerate (stretch time) progression to the event.

Common choices for $\epsilon$ give named models:

Extreme value (Gumbel) gives the Weibull AFT, the only distribution that is both AFT and proportional hazards
Normal gives the log-normal AFT
Logistic gives the log-logistic AFT

AFT coefficients have a direct interpretation on event time: $\exp(\beta_k)$ is the time ratio for a unit change in covariate $x_k$ . When proportional hazards fails but a parametric time-scale interpretation is defensible, AFT is often preferable to a misspecified Cox model. Estimation is by full likelihood since both the baseline and the covariate structure are parametric.

Time-Varying Covariates in Cox

Many covariates change during follow-up: biomarker levels, treatment dosage, employment status. Let $x_i(t)$ denote the covariate vector of subject $i$ at time $t$ . The Cox model extends to

$h(t \mid x_i(t)) = h_0(t) \exp(\beta^T x_i(t))$

and the partial log-likelihood becomes:

$\ell(\beta) = \sum_{i: \Delta_i = 1} \left[ \beta^T x_i(t_i) - \log \sum_{j \in \mathcal{R}_i} \exp(\beta^T x_j(t_i)) \right]$

At each event time $t_i$ , the risk set contributions use each subject's covariate value at that exact time, not at baseline. Implementation uses the counting-process (start, stop, event) data format, with one row per interval during which $x_i$ is constant.

Distinguish external covariates (their path is defined regardless of the subject's survival, such as calendar time or ambient temperature) from internal covariates (measurements on the subject, such as CD4 count). Internal covariates that are affected by treatment can block causal effects and bias inference when used as regression adjustments.

Deep Survival Models

Replacing $\beta^T x$ with a flexible function class gives neural and tree-based survival models that preserve the partial likelihood structure.

DeepSurv (Katzman et al. 2018, arXiv:1606.00931) replaces the linear predictor with a neural network $f_\theta(x)$ :

$h(t \mid x) = h_0(t) \exp(f_\theta(x))$

Training maximizes the Cox partial likelihood with $f_\theta$ in place of $\beta^T x$ . The model captures nonlinear interactions and has matched or beaten Cox on several clinical datasets.

Random Survival Forests (Ishwaran et al. 2008) build an ensemble of survival trees, splitting on the log-rank statistic at each node. Each tree returns a cumulative hazard function, and the forest averages them. RSF makes no proportional-hazards assumption and handles high-dimensional covariates.

DeepHit (Lee et al. 2018) is a fully parametric alternative that directly outputs a discrete survival distribution via softmax, handling competing risks without a Cox-style factorization.

Canonical Examples

Example

Clinical trial with censoring

A drug trial enrolls 5 patients. Event times (death or end of study) are: patient 1: death at month 3, patient 2: censored at month 5 (dropped out), patient 3: death at month 7, patient 4: death at month 7, patient 5: censored at month 12 (study ended).

The Kaplan-Meier estimate: at $t = 3$ , $\hat{S}(3) = 1 - 1/5 = 0.8$ . At $t = 7$ , 3 subjects remain at risk (patients 3, 4, 5), and 2 die, so $\hat{S}(7) = 0.8 \times (1 - 2/3) = 0.267$ . Patient 2's censoring at month 5 reduces the risk set but does not create a step in the survival curve.

Common Confusions

Watch Out

Censored observations are not missing data

Censored observations carry information: the event did not happen before time $C$ . Dropping censored observations wastes data and biases the analysis (you would systematically underestimate survival times). The Kaplan-Meier and Cox methods use all observations.

Watch Out

The Cox model does not model the survival function directly

The Cox model models the hazard, not the survival function. To get survival predictions, you need to estimate $h_0(t)$ separately (e.g., via the Breslow estimator). The partial likelihood only gives you $\beta$ , which tells you how covariates affect relative risk.

Summary

Censoring means you observe the event for some subjects but not all; ignoring it introduces bias
The Kaplan-Meier estimator handles censoring by reducing the risk set at censoring times without counting them as events
The hazard $h(t)$ is the instantaneous event rate given survival; it relates to survival via $S(t) = \exp(-\int_0^t h(u) du)$
The Cox model $h(t \mid x) = h_0(t) \exp(\beta^T x)$ avoids specifying the baseline hazard using partial likelihood
Always check the proportional hazards assumption when using the Cox model

Exercises

ExerciseCore

Problem

Five subjects have observed times $Y = (2, 5^+, 6, 8^+, 9)$ where $^+$ denotes censoring. Compute the Kaplan-Meier survival estimate at each observed event time.

ExerciseAdvanced

Problem

In the Cox model, show that the partial likelihood does not depend on the baseline hazard $h_0(t)$ . Start from the conditional probability that subject $(j)$ fails at time $t_j$ given that exactly one failure occurs at $t_j$ from the risk set $\mathcal{R}_j$ .

ExerciseResearch

Problem

The proportional hazards assumption is restrictive. Describe two situations where it fails and what alternative models you might use instead.

References

Canonical:

Cox, Regression Models and Life-Tables (1972), JRSS Series B
Kalbfleisch & Prentice, The Statistical Analysis of Failure Time Data (2nd ed., 2002), Ch 4-6 for Cox and partial likelihood
Klein & Moeschberger, Survival Analysis: Techniques for Censored and Truncated Data (2nd ed., 2003), Ch 4 (Kaplan-Meier), Ch 7 (log-rank), Ch 8-9 (Cox)
Andersen, Borgan, Gill & Keiding, Statistical Models Based on Counting Processes (1993), Aalen counting-process framework

Applied and diagnostic:

Kleinbaum & Klein, Survival Analysis: A Self-Learning Text (3rd ed., 2012, Springer), Ch 1-4 for Kaplan-Meier, log-rank, Cox
Therneau & Grambsch, Modeling Survival Data: Extending the Cox Model (2000), Cox diagnostics and time-varying covariates

Machine learning:

Katzman et al., DeepSurv: Personalized Treatment Recommender with Cox Proportional Hazards Deep Neural Network (2018), arXiv:1606.00931
Ishwaran et al., Random Survival Forests (2008), Annals of Applied Statistics
Kvamme et al., Time-to-Event Prediction with Neural Networks and Cox Regression (JMLR 2019)

Next Topics

Natural continuations from survival analysis:

Hypothesis testing for ML: log-rank test and other comparisons of survival curves
Causal inference basics: treatment effects in observational survival studies

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1

Derived topics

2

Hypothesis Testing for MLlayer 2 · tier 2
Causal Inference Basicslayer 3 · tier 3

Graph-backed continuations

Hypothesis Testing for ML Causal Inference Basics