Nonresponse and Missing Data

Sneiderman, Robby

Statistical Foundations

Nonresponse and Missing Data

The taxonomy of missingness mechanisms (MCAR, MAR, MNAR), their consequences for inference, and the major correction methods: multiple imputation, inverse probability weighting, and the EM algorithm.

CoreTier 2StableSupporting~50 min

Prerequisites

Common Probability Distributions Design Based vs Model Based Inference Fuzzy Matching and Record Linkage Official Statistics and National Surveys

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

statistical-foundations | layer 2 | tier 2. This page has 5 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Survey Sampling Methods

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Real data always has missing values. Patients miss clinic visits. Survey respondents skip questions. Sensors malfunction. Features in ML training sets have gaps. The question is never "is there missing data?" but "what is the mechanism that created the missingness, and what are the consequences?"

Naive handling (deleting incomplete cases, filling in means) is almost always wrong. Complete case analysis throws away data and can bias results. Mean imputation destroys variance and correlations. The correct approach depends on the missingness mechanism, and getting this wrong can silently corrupt your analysis.

Mental Model

Think of missing data as a censoring process. You have a complete dataset that would exist if everything were observed. A missingness mechanism then masks some values. The question is: does the mask depend on the values it hides?

If the mask is random (MCAR), you lose efficiency but not validity. If the mask depends on observed data (MAR), you can correct for it using observed information. If the mask depends on the hidden values themselves (MNAR), no purely statistical fix works without additional assumptions.

Missingness Mechanisms

Definition

Missing Completely at Random (MCAR)

Data is MCAR if and only if the probability of a value being missing is unrelated to any variable, observed or unobserved:

$\Pr(R = 0 \mid Y_{\text{obs}}, Y_{\text{mis}}) = \Pr(R = 0)$

where $R$ is the response indicator (1 = observed, 0 = missing), $Y_{\text{obs}}$ is the observed data, and $Y_{\text{mis}}$ is the missing data.

Example: a lab instrument randomly fails 5% of the time, independent of the measurement value. MCAR is the strongest and rarest assumption. Under MCAR, complete case analysis is unbiased (but inefficient).

Definition

Missing at Random (MAR)

Data is MAR if and only if the probability of missingness depends on observed data but not on the missing values themselves:

$\Pr(R = 0 \mid Y_{\text{obs}}, Y_{\text{mis}}) = \Pr(R = 0 \mid Y_{\text{obs}})$

Example: in a health survey, older people are less likely to respond to the income question, but among people of the same age, the probability of responding does not depend on income. MAR is the key assumption for multiple imputation and IPW to work.

MAR is untestable from the observed data alone (you would need to observe the missing values to check).

Definition

Missing Not at Random (MNAR)

Data is MNAR if and only if the probability of missingness depends on the missing value itself, even after conditioning on all observed data:

$\Pr(R = 0 \mid Y_{\text{obs}}, Y_{\text{mis}}) \text{ depends on } Y_{\text{mis}}$

Example: people with high incomes are less likely to report their income, and this relationship persists even after controlling for age, education, and other observed variables. Under MNAR, all standard methods are biased. You need a model for the missingness mechanism (a selection model) or external data to correct for it.

Consequences of Naive Approaches

Complete case analysis (listwise deletion): drop all observations with any missing value. Under MCAR, this is unbiased but wasteful. Under MAR or MNAR, it is biased because the complete cases are not representative of the full sample.

Mean imputation: replace missing values with the variable mean. This preserves the mean but underestimates the variance, distorts correlations, and narrows confidence intervals. It is almost never appropriate.

Last observation carried forward (LOCF): in longitudinal data, fill missing values with the last observed value. This assumes no change, which is a strong and usually false assumption. Common in clinical trials but widely criticized.

Multiple Imputation

Definition

Multiple Imputation (Rubin)

Multiple imputation creates $M$ complete datasets by drawing $M$ independent imputations from the posterior predictive distribution of the missing data given the observed data. Each dataset is analyzed separately, and the results are combined using Rubin's rules.

The procedure:

Specify an imputation model $p(Y_{\text{mis}} \mid Y_{\text{obs}}, \phi)$
Draw $M$ values $Y_{\text{mis}}^{(1)}, \ldots, Y_{\text{mis}}^{(M)}$ from this model
Create $M$ completed datasets: $D^{(m)} = (Y_{\text{obs}}, Y_{\text{mis}}^{(m)})$
Analyze each $D^{(m)}$ using the standard analysis method, obtaining estimates $\hat{\theta}^{(m)}$ and variance estimates $\hat{V}^{(m)}$
Combine using Rubin's rules

Main Theorems

Theorem

Rubin's Combining Rules for Multiple Imputation

Statement

Given $M$ multiply imputed datasets with complete-data estimates $\hat{\theta}^{(1)}, \ldots, \hat{\theta}^{(M)}$ and variance estimates $\hat{V}^{(1)}, \ldots, \hat{V}^{(M)}$ , the combined estimate is:

$\bar{\theta} = \frac{1}{M}\sum_{m=1}^{M} \hat{\theta}^{(m)}$

The total variance is:

$T = \bar{V} + \left(1 + \frac{1}{M}\right)B$

where $\bar{V} = \frac{1}{M}\sum_{m=1}^{M} \hat{V}^{(m)}$ is the within-imputation variance and $B = \frac{1}{M-1}\sum_{m=1}^{M}(\hat{\theta}^{(m)} - \bar{\theta})^2$ is the between-imputation variance.

Inference uses a $t$ -distribution with degrees of freedom $\nu = (M-1)(1 + \bar{V}/((1+1/M)B))^2$ (Rubin 1987). Modern practice uses the Barnard-Rubin (1999) adjusted df, which accounts for finite complete-data df and is what mice, mi, and PROC MI default to. The 1987 Rubin df overstates $\nu$ when the complete-data sample is small.

These combining rules are valid under congeniality (Meng 1994) between the imputation model and the analysis model: roughly, that the imputer's model is at least as rich as the analyst's. When they are uncongenial, the Rubin variance estimator can be conservative or anti-conservative.

Intuition

The within-imputation variance $\bar{V}$ captures the uncertainty you would have even if there were no missing data. The between-imputation variance $B$ captures the additional uncertainty due to not knowing the missing values. The factor $(1 + 1/M)$ corrects for using a finite number of imputations. As $M \to \infty$ , this factor vanishes, but $M = 5$ to $20$ is sufficient for most applications.

Proof Sketch

The point estimate $\bar{\theta}$ is the posterior mean of $\theta$ under the Bayesian framework (averaged over the posterior of the missing data). The total variance $T$ is derived from the law of total variance: $\text{Var}(\theta \mid Y_{\text{obs}}) = \mathbb{E}[\text{Var}(\theta \mid Y_{\text{obs}}, Y_{\text{mis}}) \mid Y_{\text{obs}}] + \text{Var}(\mathbb{E}[\theta \mid Y_{\text{obs}}, Y_{\text{mis}}] \mid Y_{\text{obs}})$ . The first term is estimated by $\bar{V}$ and the second by $B$ . The $(1+1/M)$ correction accounts for simulation variance.

Why It Matters

Rubin's rules are the standard method for combining multiply imputed analyses. They are used by every major statistical software package (R's mice, Stata's mi, SAS PROC MI). The key insight is that the between-imputation variance honestly reflects the uncertainty due to missing data, which single imputation methods suppress.

Failure Mode

If the MAR assumption fails (MNAR), the imputations are drawn from the wrong distribution and the combined estimate is biased. If the imputation model is misspecified (e.g., it omits important predictors of missingness), the imputations are inaccurate. If the imputation model and the analysis model are "uncongenial" (the imputation model does not include all variables used in the analysis), the results can be biased.

report a correction →

Inverse Probability Weighting

Definition

Inverse Probability Weighting (IPW)

IPW assigns each complete case a weight equal to the inverse of its probability of being observed. If unit $i$ has response probability $\pi_i = \Pr(R_i = 1 \mid Y_{\text{obs}})$ , there are two standard IPW estimators of the population mean.

Horvitz-Thompson (unnormalized):

$\hat{\mu}_{\text{HT}} = \frac{1}{n}\sum_{i: R_i=1} \frac{y_i}{\hat{\pi}_i}$

Hájek (ratio / self-normalized):

$\hat{\mu}_{\text{H}} = \frac{\sum_{i: R_i=1} y_i / \hat{\pi}_i}{\sum_{i: R_i=1} 1/\hat{\pi}_i}$

Under MAR and correct specification of the response probability model, both are consistent. The HT form is unbiased for the population mean but can be high variance when some $\hat{\pi}_i$ are small. The Hájek form is biased at finite $n$ but typically has lower variance and is location-invariant (shifting $y_i$ by a constant shifts $\hat{\mu}_{\text{H}}$ by the same constant). Hájek is usually preferred in practice.

IPW has a practical problem: if $\hat{\pi}_i$ is close to zero for some units, those units get extremely large weights, inflating the variance. Weight trimming (capping weights at some maximum) reduces variance at the cost of introducing bias.

The EM Algorithm for Missing Data

The EM (Expectation-Maximization) algorithm iterates between two steps to find maximum likelihood estimates with missing data:

E-step: compute the expected complete-data log-likelihood, where the expectation is over the missing data given the observed data and current parameter estimates.

M-step: maximize this expected log-likelihood to update the parameter estimates.

EM converges to a local maximum of the observed-data likelihood. It produces point estimates but not standard errors directly (you need the Louis formula or bootstrap for standard errors).

Common Confusions

Watch Out

MAR does not mean the missingness is random

The name is misleading. MAR means the missingness is random conditional on observed data. It can be strongly related to observed variables. For example, if men are twice as likely to skip a depression question as women, the data is MAR if the probability of skipping depends only on gender (observed) and not on the depression score itself (the missing value).

Watch Out

You cannot test MAR vs MNAR from the observed data

The difference between MAR and MNAR depends on the relationship between missingness and the unobserved values. By definition, you cannot observe this relationship. You can test MCAR vs not-MCAR (Little's test), but you cannot statistically distinguish MAR from MNAR. The choice between them requires subject-matter knowledge.

Watch Out

Multiple imputation is not about filling in the right values

The individual imputed values are not meant to be accurate predictions of the true missing values. They are random draws from the predictive distribution. The point is to capture the uncertainty about the missing values, not to guess them correctly. Any single imputed dataset is "wrong," but the ensemble of $M$ datasets correctly represents the uncertainty.

Summary

MCAR: missingness is independent of everything. Rare in practice.
MAR: missingness depends on observed data. The standard working assumption.
MNAR: missingness depends on the missing value itself. Requires a model for the missingness mechanism.
Complete case analysis is biased under MAR and MNAR
Multiple imputation: create $M$ datasets, analyze each, combine with Rubin's rules
IPW: weight complete cases by inverse probability of being observed
The between-imputation variance $B$ captures uncertainty due to missing data
MAR vs MNAR is untestable from observed data; it requires domain knowledge

Exercises

ExerciseCore

Problem

You multiply impute a dataset $M = 5$ times. The five estimates of a regression coefficient are: 2.1, 2.4, 1.9, 2.3, 2.0. The five within-imputation standard errors are: 0.5, 0.5, 0.5, 0.5, 0.5. Compute the combined estimate, within-imputation variance, between-imputation variance, and total variance using Rubin's rules.

ExerciseAdvanced

Problem

A clinical trial measures blood pressure at baseline and at 6 months. 20% of patients drop out before the 6-month measurement. You suspect sicker patients (those with higher blood pressure) are more likely to drop out. Is this MAR or MNAR? What changes if you have the baseline blood pressure for all patients?

References

Canonical:

Rubin, Multiple Imputation for Nonresponse in Surveys (1987), Chapters 1-4
Little & Rubin, Statistical Analysis with Missing Data (2019), Chapters 1-4, 10-12
Tsiatis, Semiparametric Theory and Missing Data (Springer, 2006), Chapters 6-10 (the canonical monograph on IPW and doubly-robust estimation)

Current:

van Buuren, Flexible Imputation of Missing Data (2018), Chapters 1-5
Carpenter & Kenward, Multiple Imputation and its Application (2013), Chapters 1-3
Seaman & White, "Review of inverse probability weighting for dealing with missing data", Statistical Methods in Medical Research 22(3), 278-295 (2013)
Barnard & Rubin, "Small-sample degrees of freedom with multiple imputation", Biometrika 86(4), 948-955 (1999)
Meng, "Multiple-imputation inferences with uncongenial sources of input", Statistical Science 9(4), 538-558 (1994)

Next Topics

Survey sampling methods: how nonresponse interacts with survey design
Longitudinal surveys and panel data: attrition as a specific form of nonresponse

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Common Probability Distributionslayer 0A · tier 1
Types of Bias in Statisticslayer 1 · tier 1
Design-Based vs. Model-Based Inferencelayer 2 · tier 2
Survey Sampling Methodslayer 2 · tier 2
Official Statistics and National Surveyslayer 3 · tier 3

Derived topics

1

Longitudinal Surveys and Panel Datalayer 3 · tier 3

Graph-backed continuations

Longitudinal Surveys and Panel Data