Statistical Foundations
Survival Analysis
Modeling time-to-event data with censoring: Kaplan-Meier curves, hazard functions, and the Cox proportional hazards model.
Prerequisites
Why This Matters
Many important questions in ML and beyond are about when something happens, not just whether it happens. When will a customer churn? How long until a machine fails? When will a patient relapse?
The twist: you often do not observe the event for everyone. A clinical trial ends before all patients have relapsed. A customer is still active at the end of your observation window. This is censoring, and it makes standard regression inapplicable. Survival analysis is the framework for handling time-to-event data with censoring, and it appears throughout ML in churn prediction, reliability engineering, and clinical AI.
Mental Model
Imagine tracking 100 lightbulbs, recording when each burns out. After one year, 70 have burned out but 30 are still working. You cannot simply ignore those 30 (that biases your failure time estimate downward) or pretend they failed at day 365 (also biased). You need to use the partial information: those 30 survived at least 365 days.
Survival analysis makes principled use of this partial information. The Kaplan-Meier estimator and the Cox model both handle censoring correctly.
Formal Setup and Notation
Let be a non-negative random variable representing the true event time. Let be a non-negative random variable representing the censoring time. We observe where is the observed time and is the event indicator ( if the event was observed, if censored).
Survival Function
The survival function is the probability of surviving beyond time :
where is the cumulative distribution function of . The survival function is non-increasing with .
Hazard Function
The hazard function (or hazard rate) is the instantaneous rate of the event at time , given survival up to :
where is the density of . The hazard can increase, decrease, or be constant over time.
Cumulative Hazard
The cumulative hazard is:
This connects the hazard to the survival function via .
Core Definitions
Right censoring is the most common type: you know the subject survived at least until time , but you do not know the true event time. This happens when the study ends, the subject drops out, or they experience a competing event.
The key assumption for most survival methods is non-informative censoring: the censoring mechanism is independent of the event process, conditional on covariates. If sicker patients drop out more, censoring is informative and standard methods can be biased.
Kaplan-Meier Estimator
Let be the distinct observed event times. At each , let be the number of events and the number of subjects at risk (alive and uncensored just before ). The Kaplan-Meier estimator of the survival function is:
This is a step function that drops at each observed event time.
Cox Proportional Hazards Model
The Cox model specifies the hazard for a subject with covariates as:
where is an unspecified baseline hazard and are regression coefficients. The model is semiparametric: is left completely unspecified, and inference on uses partial likelihood.
Main Theorems
Consistency of the Kaplan-Meier Estimator
Statement
Under non-informative right censoring with i.i.d. observations, the Kaplan-Meier estimator is uniformly consistent: for any such that ,
Furthermore, converges to a Gaussian process, and the pointwise variance is given by Greenwood's formula:
Intuition
Each factor in the Kaplan-Meier product estimates the conditional probability of surviving past given survival to . By the law of large numbers, each factor converges to the true conditional survival probability. The product of consistent factors is consistent.
Proof Sketch
Write and take logs: . Each term is approximately , which estimates the discrete hazard increment. By the Glivenko- Cantelli theorem applied to the Nelson-Aalen estimator , the cumulative hazard converges uniformly. Then converges uniformly via the continuous mapping theorem.
Why It Matters
The Kaplan-Meier estimator is the nonparametric workhorse of survival analysis. This theorem guarantees it works: given enough data, it accurately recovers the true survival curve even when some observations are censored.
Failure Mode
The estimator becomes unreliable in the right tail where few subjects remain at risk ( is small). Confidence intervals widen dramatically. The estimator is undefined beyond the largest observed time if that observation is censored.
Cox Partial Likelihood
Statement
Under the Cox model , the regression coefficients can be estimated by maximizing the partial likelihood:
where is the covariate vector of the subject who failed at and is the risk set at time . The maximum partial likelihood estimator is consistent and asymptotically normal.
Intuition
At each event time, the partial likelihood asks: given that someone in the risk set failed, what is the probability it was the subject who actually failed? This depends only on (through the relative hazards), not on . By conditioning on the event times, we eliminate the nuisance parameter .
Proof Sketch
Condition the full likelihood on the observed event times and the number of events at each time. The baseline hazard cancels in the conditional probability because it appears in both numerator and denominator. The resulting expression is Cox's partial likelihood. Consistency and asymptotic normality follow from general theory of estimating equations (the partial likelihood score is an unbiased estimating equation).
Why It Matters
The Cox model is the most widely used regression model in survival analysis because it avoids specifying the baseline hazard. You can estimate covariate effects without making any parametric assumption about how the baseline risk changes over time. This flexibility is why it dominates clinical trials and industrial reliability.
Failure Mode
The proportional hazards assumption requires that hazard ratios between groups are constant over time. If treatment A is better early but worse late, the Cox model gives a misleading average effect. Always check this assumption with Schoenfeld residuals or log-log survival plots.
Canonical Examples
Clinical trial with censoring
A drug trial enrolls 5 patients. Event times (death or end of study) are: patient 1: death at month 3, patient 2: censored at month 5 (dropped out), patient 3: death at month 7, patient 4: death at month 7, patient 5: censored at month 12 (study ended).
The Kaplan-Meier estimate: at , . At , 3 subjects remain at risk (patients 3, 4, 5), and 2 die, so . Patient 2's censoring at month 5 reduces the risk set but does not create a step in the survival curve.
Common Confusions
Censored observations are not missing data
Censored observations carry information: the event did not happen before time . Dropping censored observations wastes data and biases the analysis (you would systematically underestimate survival times). The Kaplan-Meier and Cox methods use all observations.
The Cox model does not model the survival function directly
The Cox model models the hazard, not the survival function. To get survival predictions, you need to estimate separately (e.g., via the Breslow estimator). The partial likelihood only gives you , which tells you how covariates affect relative risk.
Summary
- Censoring means you observe the event for some subjects but not all; ignoring it introduces bias
- The Kaplan-Meier estimator handles censoring by reducing the risk set at censoring times without counting them as events
- The hazard is the instantaneous event rate given survival; it relates to survival via
- The Cox model avoids specifying the baseline hazard using partial likelihood
- Always check the proportional hazards assumption when using the Cox model
Exercises
Problem
Five subjects have observed times where denotes censoring. Compute the Kaplan-Meier survival estimate at each observed event time.
Problem
In the Cox model, show that the partial likelihood does not depend on the baseline hazard . Start from the conditional probability that subject fails at time given that exactly one failure occurs at from the risk set .
Problem
The proportional hazards assumption is restrictive. Describe two situations where it fails and what alternative models you might use instead.
References
Canonical:
- Cox, Regression Models and Life-Tables (1972), JRSS Series B
- Kalbfleisch & Prentice, The Statistical Analysis of Failure Time Data (2nd ed., 2002)
Current:
-
Kleinbaum & Klein, Survival Analysis (3rd ed., 2012)
-
Kvamme et al., Time-to-Event Prediction with Neural Networks and Cox Regression (JMLR 2019)
-
Casella & Berger, Statistical Inference (2002), Chapters 5-10
-
Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6
Next Topics
Natural continuations from survival analysis:
- Hypothesis testing for ML: log-rank test and other comparisons of survival curves
- Causal inference basics: treatment effects in observational survival studies
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A