Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Foundations

Nonresponse and Missing Data

The taxonomy of missingness mechanisms (MCAR, MAR, MNAR), their consequences for inference, and the major correction methods: multiple imputation, inverse probability weighting, and the EM algorithm.

CoreTier 2Stable~50 min
0

Why This Matters

Real data always has missing values. Patients miss clinic visits. Survey respondents skip questions. Sensors malfunction. Features in ML training sets have gaps. The question is never "is there missing data?" but "what is the mechanism that created the missingness, and what are the consequences?"

Naive handling (deleting incomplete cases, filling in means) is almost always wrong. Complete case analysis throws away data and can bias results. Mean imputation destroys variance and correlations. The correct approach depends on the missingness mechanism, and getting this wrong can silently corrupt your analysis.

Mental Model

Think of missing data as a censoring process. You have a complete dataset that would exist if everything were observed. A missingness mechanism then masks some values. The question is: does the mask depend on the values it hides?

If the mask is random (MCAR), you lose efficiency but not validity. If the mask depends on observed data (MAR), you can correct for it using observed information. If the mask depends on the hidden values themselves (MNAR), no purely statistical fix works without additional assumptions.

Missingness Mechanisms

Definition

Missing Completely at Random (MCAR)

Data is MCAR if the probability of a value being missing is unrelated to any variable, observed or unobserved:

Pr(R=0Yobs,Ymis)=Pr(R=0)\Pr(R = 0 \mid Y_{\text{obs}}, Y_{\text{mis}}) = \Pr(R = 0)

where RR is the response indicator (1 = observed, 0 = missing), YobsY_{\text{obs}} is the observed data, and YmisY_{\text{mis}} is the missing data.

Example: a lab instrument randomly fails 5% of the time, independent of the measurement value. MCAR is the strongest and rarest assumption. Under MCAR, complete case analysis is unbiased (but inefficient).

Definition

Missing at Random (MAR)

Data is MAR if the probability of missingness depends on observed data but not on the missing values themselves:

Pr(R=0Yobs,Ymis)=Pr(R=0Yobs)\Pr(R = 0 \mid Y_{\text{obs}}, Y_{\text{mis}}) = \Pr(R = 0 \mid Y_{\text{obs}})

Example: in a health survey, older people are less likely to respond to the income question, but among people of the same age, the probability of responding does not depend on income. MAR is the key assumption for multiple imputation and IPW to work.

MAR is untestable from the observed data alone (you would need to observe the missing values to check).

Definition

Missing Not at Random (MNAR)

Data is MNAR if the probability of missingness depends on the missing value itself, even after conditioning on all observed data:

Pr(R=0Yobs,Ymis) depends on Ymis\Pr(R = 0 \mid Y_{\text{obs}}, Y_{\text{mis}}) \text{ depends on } Y_{\text{mis}}

Example: people with high incomes are less likely to report their income, and this relationship persists even after controlling for age, education, and other observed variables. Under MNAR, all standard methods are biased. You need a model for the missingness mechanism (a selection model) or external data to correct for it.

Consequences of Naive Approaches

Complete case analysis (listwise deletion): drop all observations with any missing value. Under MCAR, this is unbiased but wasteful. Under MAR or MNAR, it is biased because the complete cases are not representative of the full sample.

Mean imputation: replace missing values with the variable mean. This preserves the mean but underestimates the variance, distorts correlations, and narrows confidence intervals. It is almost never appropriate.

Last observation carried forward (LOCF): in longitudinal data, fill missing values with the last observed value. This assumes no change, which is a strong and usually false assumption. Common in clinical trials but widely criticized.

Multiple Imputation

Definition

Multiple Imputation (Rubin)

Multiple imputation creates MM complete datasets by drawing MM independent imputations from the posterior predictive distribution of the missing data given the observed data. Each dataset is analyzed separately, and the results are combined using Rubin's rules.

The procedure:

  1. Specify an imputation model p(YmisYobs,ϕ)p(Y_{\text{mis}} \mid Y_{\text{obs}}, \phi)
  2. Draw MM values Ymis(1),,Ymis(M)Y_{\text{mis}}^{(1)}, \ldots, Y_{\text{mis}}^{(M)} from this model
  3. Create MM completed datasets: D(m)=(Yobs,Ymis(m))D^{(m)} = (Y_{\text{obs}}, Y_{\text{mis}}^{(m)})
  4. Analyze each D(m)D^{(m)} using the standard analysis method, obtaining estimates θ^(m)\hat{\theta}^{(m)} and variance estimates V^(m)\hat{V}^{(m)}
  5. Combine using Rubin's rules

Main Theorems

Theorem

Rubin's Combining Rules for Multiple Imputation

Statement

Given MM multiply imputed datasets with complete-data estimates θ^(1),,θ^(M)\hat{\theta}^{(1)}, \ldots, \hat{\theta}^{(M)} and variance estimates V^(1),,V^(M)\hat{V}^{(1)}, \ldots, \hat{V}^{(M)}, the combined estimate is:

θˉ=1Mm=1Mθ^(m)\bar{\theta} = \frac{1}{M}\sum_{m=1}^{M} \hat{\theta}^{(m)}

The total variance is:

T=Vˉ+(1+1M)BT = \bar{V} + \left(1 + \frac{1}{M}\right)B

where Vˉ=1Mm=1MV^(m)\bar{V} = \frac{1}{M}\sum_{m=1}^{M} \hat{V}^{(m)} is the within-imputation variance and B=1M1m=1M(θ^(m)θˉ)2B = \frac{1}{M-1}\sum_{m=1}^{M}(\hat{\theta}^{(m)} - \bar{\theta})^2 is the between-imputation variance.

Inference uses a tt-distribution with degrees of freedom ν=(M1)(1+Vˉ/((1+1/M)B))2\nu = (M-1)(1 + \bar{V}/((1+1/M)B))^2.

Intuition

The within-imputation variance Vˉ\bar{V} captures the uncertainty you would have even if there were no missing data. The between-imputation variance BB captures the additional uncertainty due to not knowing the missing values. The factor (1+1/M)(1 + 1/M) corrects for using a finite number of imputations. As MM \to \infty, this factor vanishes, but M=5M = 5 to 2020 is sufficient for most applications.

Proof Sketch

The point estimate θˉ\bar{\theta} is the posterior mean of θ\theta under the Bayesian framework (averaged over the posterior of the missing data). The total variance TT is derived from the law of total variance: Var(θYobs)=E[Var(θYobs,Ymis)Yobs]+Var(E[θYobs,Ymis]Yobs)\text{Var}(\theta \mid Y_{\text{obs}}) = \mathbb{E}[\text{Var}(\theta \mid Y_{\text{obs}}, Y_{\text{mis}}) \mid Y_{\text{obs}}] + \text{Var}(\mathbb{E}[\theta \mid Y_{\text{obs}}, Y_{\text{mis}}] \mid Y_{\text{obs}}). The first term is estimated by Vˉ\bar{V} and the second by BB. The (1+1/M)(1+1/M) correction accounts for simulation variance.

Why It Matters

Rubin's rules are the standard method for combining multiply imputed analyses. They are used by every major statistical software package (R's mice, Stata's mi, SAS PROC MI). The key insight is that the between-imputation variance honestly reflects the uncertainty due to missing data, which single imputation methods suppress.

Failure Mode

If the MAR assumption fails (MNAR), the imputations are drawn from the wrong distribution and the combined estimate is biased. If the imputation model is misspecified (e.g., it omits important predictors of missingness), the imputations are inaccurate. If the imputation model and the analysis model are "uncongenial" (the imputation model does not include all variables used in the analysis), the results can be biased.

Inverse Probability Weighting

Definition

Inverse Probability Weighting (IPW)

IPW assigns each complete case a weight equal to the inverse of its probability of being observed. If unit ii has response probability πi=Pr(Ri=1Yobs)\pi_i = \Pr(R_i = 1 \mid Y_{\text{obs}}), the IPW estimator of the population mean is:

μ^IPW=i:Ri=1yi/π^ii:Ri=11/π^i\hat{\mu}_{\text{IPW}} = \frac{\sum_{i: R_i=1} y_i / \hat{\pi}_i}{\sum_{i: R_i=1} 1/\hat{\pi}_i}

Under MAR and correct specification of the response probability model, IPW is consistent. This is the same principle as the Horvitz-Thompson estimator in survey sampling, applied to nonresponse.

IPW has a practical problem: if π^i\hat{\pi}_i is close to zero for some units, those units get extremely large weights, inflating the variance. Weight trimming (capping weights at some maximum) reduces variance at the cost of introducing bias.

The EM Algorithm for Missing Data

The EM (Expectation-Maximization) algorithm iterates between two steps to find maximum likelihood estimates with missing data:

E-step: compute the expected complete-data log-likelihood, where the expectation is over the missing data given the observed data and current parameter estimates.

M-step: maximize this expected log-likelihood to update the parameter estimates.

EM converges to a local maximum of the observed-data likelihood. It produces point estimates but not standard errors directly (you need the Louis formula or bootstrap for standard errors).

Common Confusions

Watch Out

MAR does not mean the missingness is random

The name is misleading. MAR means the missingness is random conditional on observed data. It can be strongly related to observed variables. For example, if men are twice as likely to skip a depression question as women, the data is MAR if the probability of skipping depends only on gender (observed) and not on the depression score itself (the missing value).

Watch Out

You cannot test MAR vs MNAR from the observed data

The difference between MAR and MNAR depends on the relationship between missingness and the unobserved values. By definition, you cannot observe this relationship. You can test MCAR vs not-MCAR (Little's test), but you cannot statistically distinguish MAR from MNAR. The choice between them requires subject-matter knowledge.

Watch Out

Multiple imputation is not about filling in the right values

The individual imputed values are not meant to be accurate predictions of the true missing values. They are random draws from the predictive distribution. The point is to capture the uncertainty about the missing values, not to guess them correctly. Any single imputed dataset is "wrong," but the ensemble of MM datasets correctly represents the uncertainty.

Summary

  • MCAR: missingness is independent of everything. Rare in practice.
  • MAR: missingness depends on observed data. The standard working assumption.
  • MNAR: missingness depends on the missing value itself. Requires a model for the missingness mechanism.
  • Complete case analysis is biased under MAR and MNAR
  • Multiple imputation: create MM datasets, analyze each, combine with Rubin's rules
  • IPW: weight complete cases by inverse probability of being observed
  • The between-imputation variance BB captures uncertainty due to missing data
  • MAR vs MNAR is untestable from observed data; it requires domain knowledge

Exercises

ExerciseCore

Problem

You multiply impute a dataset M=5M = 5 times. The five estimates of a regression coefficient are: 2.1, 2.4, 1.9, 2.3, 2.0. The five within-imputation standard errors are: 0.5, 0.5, 0.5, 0.5, 0.5. Compute the combined estimate, within-imputation variance, between-imputation variance, and total variance using Rubin's rules.

ExerciseAdvanced

Problem

A clinical trial measures blood pressure at baseline and at 6 months. 20% of patients drop out before the 6-month measurement. You suspect sicker patients (those with higher blood pressure) are more likely to drop out. Is this MAR or MNAR? What changes if you have the baseline blood pressure for all patients?

References

Canonical:

  • Rubin, Multiple Imputation for Nonresponse in Surveys (1987), Chapters 1-4
  • Little & Rubin, Statistical Analysis with Missing Data (2019), Chapters 1-4, 10-12

Current:

  • van Buuren, Flexible Imputation of Missing Data (2018), Chapters 1-5

  • Carpenter & Kenward, Multiple Imputation and its Application (2013), Chapters 1-3

  • Casella & Berger, Statistical Inference (2002), Chapters 5-10

  • Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics