Methodology
Causal Inference Basics
Correlation is not causation. The potential outcomes framework, average treatment effects, confounders, and the methods that let you estimate causal effects from data.
Prerequisites
Why This Matters
Machine learning excels at prediction: given features , predict outcome . But many important questions are causal: if we change , what happens to ? Will this drug reduce mortality? Will showing a different ad increase clicks? Will a policy intervention reduce crime?
Prediction and causation are structurally different. A linear regression model can predict hospital mortality perfectly using "is on a ventilator" as a feature, but putting healthy people on ventilators does not help. Confounders, variables that cause both treatment and outcome, are the central obstacle. Causal inference provides the framework for identifying when and how you can estimate causal effects from data, and it matters for A/B testing, fairness, treatment effect estimation, and counterfactual reasoning in ML.
Mental Model
Imagine you want to know if a new drug works. Ideally, you would give each patient both the drug and a placebo, then compare outcomes. But each patient can only live one life. This is the fundamental problem of causal inference: you never observe both potential outcomes for the same unit.
Randomized experiments solve this by making treated and control groups comparable on average. Without randomization, you must argue carefully that you have accounted for all confounders. Observational causal inference is the art of making that argument rigorous.
Formal Setup and Notation
Consider units (e.g., patients) indexed by . Each unit has a binary treatment , covariates , and an observed outcome .
Potential Outcomes
For each unit , the potential outcomes are:
- : the outcome if unit receives treatment ()
- : the outcome if unit receives control ()
The observed outcome is . The individual treatment effect is , which is never observed because we only see one potential outcome per unit.
Average Treatment Effect (ATE)
The average treatment effect is:
This is the expected causal effect of treatment across the population.
Average Treatment Effect on the Treated (ATT)
The ATT conditions on the treated group:
This answers: for those who actually received treatment, what was the average effect?
Confounder
A confounder is a variable that causally affects both the treatment assignment and the outcome . When confounders are present, the naive comparison does not equal the ATE because treated and control groups differ systematically.
Core Definitions
The identification problem: to estimate the ATE from observational data, we need assumptions that link the observed data distribution to the causal quantity. The two standard assumptions are:
Ignorability (also called unconfoundedness or conditional independence):
Given covariates , treatment assignment is independent of potential outcomes. This means there are no unmeasured confounders.
Overlap (positivity):
Every unit has a nonzero chance of receiving either treatment or control. Without overlap, you cannot compare treated and control units with similar covariates.
Propensity Score
The propensity score is the conditional probability of treatment given covariates:
The propensity score is a scalar summary of the covariates that is sufficient for confounding adjustment.
Main Theorems
Identification of the ATE Under Ignorability
Statement
Under ignorability and overlap, the average treatment effect is identified from observed data:
Equivalently, using inverse propensity weighting (IPW):
Both expressions equal .
Intuition
Ignorability says that within groups of units with the same , treatment is effectively randomized. So the difference in outcomes within each -stratum is a valid causal estimate. Averaging over strata gives the ATE. The IPW formula reweights each observation to create a pseudo-population where treatment is independent of , mimicking a randomized experiment.
Proof Sketch
For the regression form: by ignorability, . Similarly for . So the conditional difference is , and taking the outer expectation gives .
For IPW: . By ignorability, . Dividing by and taking the outer expectation gives . The same argument with gives .
Why It Matters
This is the foundational identification result of causal inference. It says exactly when and how you can estimate causal effects from observational data. Without ignorability, no amount of statistical sophistication gives you a causal effect. The theorem also motivates practical methods: outcome regression, IPW, and their doubly-robust combination.
Failure Mode
The theorem relies entirely on the ignorability assumption, which is untestable. If there are unmeasured confounders (e.g., disease severity that is not recorded in the data), the estimate is biased. Sensitivity analysis can assess how large an unmeasured confounder would need to be to change your conclusions.
Rosenbaum-Rubin Propensity Score Theorem
Statement
If , then:
That is, adjusting for the scalar propensity score is sufficient for confounding control, even when is high-dimensional.
Intuition
The propensity score is a "summary score" that captures everything about that matters for treatment assignment. Two units with the same propensity score have the same probability of treatment, so comparing them is like a randomized experiment, even if their covariates differ in other ways.
Proof Sketch
We need to show . By the law of iterated expectations:
By ignorability, . Since is a function of that we are conditioning on, this equals almost surely.
Why It Matters
The propensity score reduces a high-dimensional adjustment problem to a one-dimensional one. Instead of matching or stratifying on all covariates (which is impractical in high dimensions due to the curse of dimensionality), you can match, stratify, or weight by the single propensity score. This is why propensity score methods are ubiquitous in observational studies.
Failure Mode
The propensity score does not solve the fundamental problem of unmeasured confounders. If ignorability fails given , it also fails given . Also, propensity score matching discards unmatched units, which can change the target estimand. Extreme propensity scores near 0 or 1 (lack of overlap) cause high variance in IPW estimates.
Canonical Examples
Observational study with confounding
You want to estimate the effect of a job training program on earnings. People who enroll are different from those who do not: they may be more motivated or have different baseline skills. Motivation is a confounder (it affects both enrollment and earnings).
A naive comparison of enrollees vs. non-enrollees gives a biased estimate. With measured covariates (age, education, prior earnings), you can use propensity score matching: match each enrollee with a non-enrollee who had similar probability of enrolling. The difference in outcomes within matched pairs estimates the ATT, assuming all confounders are measured.
Common Confusions
A/B tests are randomized experiments, not observational studies
When you run an A/B test, you randomly assign users to treatment and control. This guarantees ignorability by design: no confounders can bias the comparison. You do not need propensity scores or instrumental variables. The simple difference in means is an unbiased estimate of the ATE. The machinery of observational causal inference is for when you cannot randomize.
Prediction models do not give causal effects
A model that predicts well from and tells you about associations. The coefficient on in a regression is not the causal effect of unless the regression satisfies ignorability. This distinction between predictive and causal models is critical for hypothesis testing in applied settings. A doctor who treats sicker patients will have worse patient outcomes, but this does not mean the doctor is harmful. The association between treatment and outcome is confounded by severity.
Doubly robust is not doubly safe
The doubly robust estimator combines outcome regression and IPW, and it is consistent if either model is correctly specified (not necessarily both). But it is not robust to unmeasured confounders. The "double robustness" is about modeling flexibility, not about the causal assumptions.
Summary
- The fundamental problem: you only observe one potential outcome per unit, so individual causal effects are never directly observable
- ATE is the target; the naive difference equals it only under ignorability
- Ignorability (no unmeasured confounders) is untestable and is the critical assumption for all observational methods
- Propensity scores (often estimated via logistic regression) reduce high-dimensional confounding adjustment to a scalar
- RCTs guarantee ignorability by design; observational methods try to approximate this with measured covariates
- Instrumental variables and difference-in-differences work when ignorability fails, under different assumptions
Exercises
Problem
In an observational study, older people are more likely to receive treatment and also have worse outcomes. Is age a confounder? Draw the causal graph and explain why the naive treatment-outcome comparison is biased.
Problem
Prove that the IPW estimator is unbiased for the ATE under ignorability and overlap when the propensity score is known.
Problem
Instrumental variables (IV) estimation does not require ignorability. Instead, it uses an instrument that affects treatment but has no direct effect on outcome . State the IV assumptions precisely and derive the Wald estimator for the case where both and are binary.
References
Canonical:
- Rubin, Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies (1974)
- Rosenbaum & Rubin, The Central Role of the Propensity Score in Observational Studies (1983)
Current:
-
Imbens & Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences (2015)
-
Chernozhukov et al., Double/Debiased Machine Learning for Treatment and Structural Parameters (Econometrica, 2018)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
Next Topics
Causal inference connects to many ML topics:
- Treatment effect estimation with machine learning (causal forests, CATE)
- Fairness as a causal concept
- Counterfactual reasoning in model explanations
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Hypothesis Testing for MLLayer 2