Statistical Foundations
Nonresponse and Missing Data
The taxonomy of missingness mechanisms (MCAR, MAR, MNAR), their consequences for inference, and the major correction methods: multiple imputation, inverse probability weighting, and the EM algorithm.
Prerequisites
Why This Matters
Real data always has missing values. Patients miss clinic visits. Survey respondents skip questions. Sensors malfunction. Features in ML training sets have gaps. The question is never "is there missing data?" but "what is the mechanism that created the missingness, and what are the consequences?"
Naive handling (deleting incomplete cases, filling in means) is almost always wrong. Complete case analysis throws away data and can bias results. Mean imputation destroys variance and correlations. The correct approach depends on the missingness mechanism, and getting this wrong can silently corrupt your analysis.
Mental Model
Think of missing data as a censoring process. You have a complete dataset that would exist if everything were observed. A missingness mechanism then masks some values. The question is: does the mask depend on the values it hides?
If the mask is random (MCAR), you lose efficiency but not validity. If the mask depends on observed data (MAR), you can correct for it using observed information. If the mask depends on the hidden values themselves (MNAR), no purely statistical fix works without additional assumptions.
Missingness Mechanisms
Missing Completely at Random (MCAR)
Data is MCAR if the probability of a value being missing is unrelated to any variable, observed or unobserved:
where is the response indicator (1 = observed, 0 = missing), is the observed data, and is the missing data.
Example: a lab instrument randomly fails 5% of the time, independent of the measurement value. MCAR is the strongest and rarest assumption. Under MCAR, complete case analysis is unbiased (but inefficient).
Missing at Random (MAR)
Data is MAR if the probability of missingness depends on observed data but not on the missing values themselves:
Example: in a health survey, older people are less likely to respond to the income question, but among people of the same age, the probability of responding does not depend on income. MAR is the key assumption for multiple imputation and IPW to work.
MAR is untestable from the observed data alone (you would need to observe the missing values to check).
Missing Not at Random (MNAR)
Data is MNAR if the probability of missingness depends on the missing value itself, even after conditioning on all observed data:
Example: people with high incomes are less likely to report their income, and this relationship persists even after controlling for age, education, and other observed variables. Under MNAR, all standard methods are biased. You need a model for the missingness mechanism (a selection model) or external data to correct for it.
Consequences of Naive Approaches
Complete case analysis (listwise deletion): drop all observations with any missing value. Under MCAR, this is unbiased but wasteful. Under MAR or MNAR, it is biased because the complete cases are not representative of the full sample.
Mean imputation: replace missing values with the variable mean. This preserves the mean but underestimates the variance, distorts correlations, and narrows confidence intervals. It is almost never appropriate.
Last observation carried forward (LOCF): in longitudinal data, fill missing values with the last observed value. This assumes no change, which is a strong and usually false assumption. Common in clinical trials but widely criticized.
Multiple Imputation
Multiple Imputation (Rubin)
Multiple imputation creates complete datasets by drawing independent imputations from the posterior predictive distribution of the missing data given the observed data. Each dataset is analyzed separately, and the results are combined using Rubin's rules.
The procedure:
- Specify an imputation model
- Draw values from this model
- Create completed datasets:
- Analyze each using the standard analysis method, obtaining estimates and variance estimates
- Combine using Rubin's rules
Main Theorems
Rubin's Combining Rules for Multiple Imputation
Statement
Given multiply imputed datasets with complete-data estimates and variance estimates , the combined estimate is:
The total variance is:
where is the within-imputation variance and is the between-imputation variance.
Inference uses a -distribution with degrees of freedom .
Intuition
The within-imputation variance captures the uncertainty you would have even if there were no missing data. The between-imputation variance captures the additional uncertainty due to not knowing the missing values. The factor corrects for using a finite number of imputations. As , this factor vanishes, but to is sufficient for most applications.
Proof Sketch
The point estimate is the posterior mean of under the Bayesian framework (averaged over the posterior of the missing data). The total variance is derived from the law of total variance: . The first term is estimated by and the second by . The correction accounts for simulation variance.
Why It Matters
Rubin's rules are the standard method for combining multiply imputed analyses. They are used by every major statistical software package (R's mice, Stata's mi, SAS PROC MI). The key insight is that the between-imputation variance honestly reflects the uncertainty due to missing data, which single imputation methods suppress.
Failure Mode
If the MAR assumption fails (MNAR), the imputations are drawn from the wrong distribution and the combined estimate is biased. If the imputation model is misspecified (e.g., it omits important predictors of missingness), the imputations are inaccurate. If the imputation model and the analysis model are "uncongenial" (the imputation model does not include all variables used in the analysis), the results can be biased.
Inverse Probability Weighting
Inverse Probability Weighting (IPW)
IPW assigns each complete case a weight equal to the inverse of its probability of being observed. If unit has response probability , the IPW estimator of the population mean is:
Under MAR and correct specification of the response probability model, IPW is consistent. This is the same principle as the Horvitz-Thompson estimator in survey sampling, applied to nonresponse.
IPW has a practical problem: if is close to zero for some units, those units get extremely large weights, inflating the variance. Weight trimming (capping weights at some maximum) reduces variance at the cost of introducing bias.
The EM Algorithm for Missing Data
The EM (Expectation-Maximization) algorithm iterates between two steps to find maximum likelihood estimates with missing data:
E-step: compute the expected complete-data log-likelihood, where the expectation is over the missing data given the observed data and current parameter estimates.
M-step: maximize this expected log-likelihood to update the parameter estimates.
EM converges to a local maximum of the observed-data likelihood. It produces point estimates but not standard errors directly (you need the Louis formula or bootstrap for standard errors).
Common Confusions
MAR does not mean the missingness is random
The name is misleading. MAR means the missingness is random conditional on observed data. It can be strongly related to observed variables. For example, if men are twice as likely to skip a depression question as women, the data is MAR if the probability of skipping depends only on gender (observed) and not on the depression score itself (the missing value).
You cannot test MAR vs MNAR from the observed data
The difference between MAR and MNAR depends on the relationship between missingness and the unobserved values. By definition, you cannot observe this relationship. You can test MCAR vs not-MCAR (Little's test), but you cannot statistically distinguish MAR from MNAR. The choice between them requires subject-matter knowledge.
Multiple imputation is not about filling in the right values
The individual imputed values are not meant to be accurate predictions of the true missing values. They are random draws from the predictive distribution. The point is to capture the uncertainty about the missing values, not to guess them correctly. Any single imputed dataset is "wrong," but the ensemble of datasets correctly represents the uncertainty.
Summary
- MCAR: missingness is independent of everything. Rare in practice.
- MAR: missingness depends on observed data. The standard working assumption.
- MNAR: missingness depends on the missing value itself. Requires a model for the missingness mechanism.
- Complete case analysis is biased under MAR and MNAR
- Multiple imputation: create datasets, analyze each, combine with Rubin's rules
- IPW: weight complete cases by inverse probability of being observed
- The between-imputation variance captures uncertainty due to missing data
- MAR vs MNAR is untestable from observed data; it requires domain knowledge
Exercises
Problem
You multiply impute a dataset times. The five estimates of a regression coefficient are: 2.1, 2.4, 1.9, 2.3, 2.0. The five within-imputation standard errors are: 0.5, 0.5, 0.5, 0.5, 0.5. Compute the combined estimate, within-imputation variance, between-imputation variance, and total variance using Rubin's rules.
Problem
A clinical trial measures blood pressure at baseline and at 6 months. 20% of patients drop out before the 6-month measurement. You suspect sicker patients (those with higher blood pressure) are more likely to drop out. Is this MAR or MNAR? What changes if you have the baseline blood pressure for all patients?
References
Canonical:
- Rubin, Multiple Imputation for Nonresponse in Surveys (1987), Chapters 1-4
- Little & Rubin, Statistical Analysis with Missing Data (2019), Chapters 1-4, 10-12
Current:
-
van Buuren, Flexible Imputation of Missing Data (2018), Chapters 1-5
-
Carpenter & Kenward, Multiple Imputation and its Application (2013), Chapters 1-3
-
Casella & Berger, Statistical Inference (2002), Chapters 5-10
-
Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6
Next Topics
- Survey sampling methods: how nonresponse interacts with survey design
- Longitudinal surveys and panel data: attrition as a specific form of nonresponse
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A