Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Causal Inference Basics

Correlation is not causation. The potential outcomes framework, average treatment effects, confounders, and the methods that let you estimate causal effects from data.

AdvancedTier 3Stable~55 min

Why This Matters

Machine learning excels at prediction: given features XX, predict outcome YY. But many important questions are causal: if we change XX, what happens to YY? Will this drug reduce mortality? Will showing a different ad increase clicks? Will a policy intervention reduce crime?

Prediction and causation are structurally different. A linear regression model can predict hospital mortality perfectly using "is on a ventilator" as a feature, but putting healthy people on ventilators does not help. Confounders, variables that cause both treatment and outcome, are the central obstacle. Causal inference provides the framework for identifying when and how you can estimate causal effects from data, and it matters for A/B testing, fairness, treatment effect estimation, and counterfactual reasoning in ML.

Mental Model

Imagine you want to know if a new drug works. Ideally, you would give each patient both the drug and a placebo, then compare outcomes. But each patient can only live one life. This is the fundamental problem of causal inference: you never observe both potential outcomes for the same unit.

Randomized experiments solve this by making treated and control groups comparable on average. Without randomization, you must argue carefully that you have accounted for all confounders. Observational causal inference is the art of making that argument rigorous.

Formal Setup and Notation

Consider nn units (e.g., patients) indexed by ii. Each unit has a binary treatment Wi{0,1}W_i \in \{0, 1\}, covariates XiX_i, and an observed outcome YiY_i.

Definition

Potential Outcomes

For each unit ii, the potential outcomes are:

  • Yi(1)Y_i(1): the outcome if unit ii receives treatment (Wi=1W_i = 1)
  • Yi(0)Y_i(0): the outcome if unit ii receives control (Wi=0W_i = 0)

The observed outcome is Yi=WiYi(1)+(1Wi)Yi(0)Y_i = W_i Y_i(1) + (1 - W_i) Y_i(0). The individual treatment effect is τi=Yi(1)Yi(0)\tau_i = Y_i(1) - Y_i(0), which is never observed because we only see one potential outcome per unit.

Definition

Average Treatment Effect (ATE)

The average treatment effect is:

τ=E[Yi(1)Yi(0)]\tau = \mathbb{E}[Y_i(1) - Y_i(0)]

This is the expected causal effect of treatment across the population.

Definition

Average Treatment Effect on the Treated (ATT)

The ATT conditions on the treated group:

τATT=E[Yi(1)Yi(0)Wi=1]\tau_{\text{ATT}} = \mathbb{E}[Y_i(1) - Y_i(0) \mid W_i = 1]

This answers: for those who actually received treatment, what was the average effect?

Definition

Confounder

A confounder is a variable XX that causally affects both the treatment assignment WW and the outcome YY. When confounders are present, the naive comparison E[YW=1]E[YW=0]\mathbb{E}[Y \mid W=1] - \mathbb{E}[Y \mid W=0] does not equal the ATE because treated and control groups differ systematically.

Core Definitions

The identification problem: to estimate the ATE from observational data, we need assumptions that link the observed data distribution to the causal quantity. The two standard assumptions are:

Ignorability (also called unconfoundedness or conditional independence):

(Y(0),Y(1))WX(Y(0), Y(1)) \perp W \mid X

Given covariates XX, treatment assignment is independent of potential outcomes. This means there are no unmeasured confounders.

Overlap (positivity):

0<P(W=1X=x)<1 for all x0 < P(W = 1 \mid X = x) < 1 \text{ for all } x

Every unit has a nonzero chance of receiving either treatment or control. Without overlap, you cannot compare treated and control units with similar covariates.

Definition

Propensity Score

The propensity score is the conditional probability of treatment given covariates:

e(x)=P(W=1X=x)e(x) = P(W = 1 \mid X = x)

The propensity score is a scalar summary of the covariates that is sufficient for confounding adjustment.

Main Theorems

Theorem

Identification of the ATE Under Ignorability

Statement

Under ignorability and overlap, the average treatment effect is identified from observed data:

τ=E[E[YX,W=1]E[YX,W=0]]\tau = \mathbb{E}\left[\mathbb{E}[Y \mid X, W=1] - \mathbb{E}[Y \mid X, W=0]\right]

Equivalently, using inverse propensity weighting (IPW):

τ=E[WYe(X)(1W)Y1e(X)]\tau = \mathbb{E}\left[\frac{WY}{e(X)} - \frac{(1-W)Y}{1-e(X)}\right]

Both expressions equal E[Y(1)Y(0)]\mathbb{E}[Y(1) - Y(0)].

Intuition

Ignorability says that within groups of units with the same XX, treatment is effectively randomized. So the difference in outcomes within each XX-stratum is a valid causal estimate. Averaging over strata gives the ATE. The IPW formula reweights each observation to create a pseudo-population where treatment is independent of XX, mimicking a randomized experiment.

Proof Sketch

For the regression form: by ignorability, E[YX,W=1]=E[Y(1)X,W=1]=E[Y(1)X]\mathbb{E}[Y \mid X, W=1] = \mathbb{E}[Y(1) \mid X, W=1] = \mathbb{E}[Y(1) \mid X]. Similarly for W=0W = 0. So the conditional difference is E[Y(1)Y(0)X]\mathbb{E}[Y(1) - Y(0) \mid X], and taking the outer expectation gives τ\tau.

For IPW: E[WY/e(X)]=E[E[WY/e(X)X]]\mathbb{E}[WY/e(X)] = \mathbb{E}[\mathbb{E}[WY/e(X) \mid X]]. By ignorability, E[WYX]=E[WX]E[Y(1)X]=e(X)E[Y(1)X]\mathbb{E}[WY \mid X] = \mathbb{E}[W \mid X] \mathbb{E}[Y(1) \mid X] = e(X) \mathbb{E}[Y(1) \mid X]. Dividing by e(X)e(X) and taking the outer expectation gives E[Y(1)]\mathbb{E}[Y(1)]. The same argument with (1W)/(1e(X))(1-W)/(1-e(X)) gives E[Y(0)]\mathbb{E}[Y(0)].

Why It Matters

This is the foundational identification result of causal inference. It says exactly when and how you can estimate causal effects from observational data. Without ignorability, no amount of statistical sophistication gives you a causal effect. The theorem also motivates practical methods: outcome regression, IPW, and their doubly-robust combination.

Failure Mode

The theorem relies entirely on the ignorability assumption, which is untestable. If there are unmeasured confounders (e.g., disease severity that is not recorded in the data), the estimate is biased. Sensitivity analysis can assess how large an unmeasured confounder would need to be to change your conclusions.

Theorem

Rosenbaum-Rubin Propensity Score Theorem

Statement

If (Y(0),Y(1))WX(Y(0), Y(1)) \perp W \mid X, then:

(Y(0),Y(1))We(X)(Y(0), Y(1)) \perp W \mid e(X)

That is, adjusting for the scalar propensity score e(X)e(X) is sufficient for confounding control, even when XX is high-dimensional.

Intuition

The propensity score is a "summary score" that captures everything about XX that matters for treatment assignment. Two units with the same propensity score have the same probability of treatment, so comparing them is like a randomized experiment, even if their covariates XX differ in other ways.

Proof Sketch

We need to show P(W=1Y(0),Y(1),e(X))=P(W=1e(X))P(W=1 \mid Y(0), Y(1), e(X)) = P(W=1 \mid e(X)). By the law of iterated expectations:

P(W=1Y(0),Y(1),e(X))=E[P(W=1Y(0),Y(1),X)Y(0),Y(1),e(X)]P(W=1 \mid Y(0), Y(1), e(X)) = \mathbb{E}[P(W=1 \mid Y(0), Y(1), X) \mid Y(0), Y(1), e(X)]

By ignorability, P(W=1Y(0),Y(1),X)=P(W=1X)=e(X)P(W=1 \mid Y(0), Y(1), X) = P(W=1 \mid X) = e(X). Since e(X)e(X) is a function of XX that we are conditioning on, this equals e(X)e(X) almost surely.

Why It Matters

The propensity score reduces a high-dimensional adjustment problem to a one-dimensional one. Instead of matching or stratifying on all covariates XX (which is impractical in high dimensions due to the curse of dimensionality), you can match, stratify, or weight by the single propensity score. This is why propensity score methods are ubiquitous in observational studies.

Failure Mode

The propensity score does not solve the fundamental problem of unmeasured confounders. If ignorability fails given XX, it also fails given e(X)e(X). Also, propensity score matching discards unmatched units, which can change the target estimand. Extreme propensity scores near 0 or 1 (lack of overlap) cause high variance in IPW estimates.

Canonical Examples

Example

Observational study with confounding

You want to estimate the effect of a job training program on earnings. People who enroll are different from those who do not: they may be more motivated or have different baseline skills. Motivation is a confounder (it affects both enrollment and earnings).

A naive comparison of enrollees vs. non-enrollees gives a biased estimate. With measured covariates (age, education, prior earnings), you can use propensity score matching: match each enrollee with a non-enrollee who had similar probability of enrolling. The difference in outcomes within matched pairs estimates the ATT, assuming all confounders are measured.

Common Confusions

Watch Out

A/B tests are randomized experiments, not observational studies

When you run an A/B test, you randomly assign users to treatment and control. This guarantees ignorability by design: no confounders can bias the comparison. You do not need propensity scores or instrumental variables. The simple difference in means is an unbiased estimate of the ATE. The machinery of observational causal inference is for when you cannot randomize.

Watch Out

Prediction models do not give causal effects

A model that predicts YY well from XX and WW tells you about associations. The coefficient on WW in a regression is not the causal effect of WW unless the regression satisfies ignorability. This distinction between predictive and causal models is critical for hypothesis testing in applied settings. A doctor who treats sicker patients will have worse patient outcomes, but this does not mean the doctor is harmful. The association between treatment and outcome is confounded by severity.

Watch Out

Doubly robust is not doubly safe

The doubly robust estimator combines outcome regression and IPW, and it is consistent if either model is correctly specified (not necessarily both). But it is not robust to unmeasured confounders. The "double robustness" is about modeling flexibility, not about the causal assumptions.

Summary

  • The fundamental problem: you only observe one potential outcome per unit, so individual causal effects are never directly observable
  • ATE =E[Y(1)Y(0)]= \mathbb{E}[Y(1) - Y(0)] is the target; the naive difference E[YW=1]E[YW=0]\mathbb{E}[Y|W=1] - \mathbb{E}[Y|W=0] equals it only under ignorability
  • Ignorability (no unmeasured confounders) is untestable and is the critical assumption for all observational methods
  • Propensity scores (often estimated via logistic regression) reduce high-dimensional confounding adjustment to a scalar
  • RCTs guarantee ignorability by design; observational methods try to approximate this with measured covariates
  • Instrumental variables and difference-in-differences work when ignorability fails, under different assumptions

Exercises

ExerciseCore

Problem

In an observational study, older people are more likely to receive treatment and also have worse outcomes. Is age a confounder? Draw the causal graph and explain why the naive treatment-outcome comparison is biased.

ExerciseAdvanced

Problem

Prove that the IPW estimator τ^=1ni=1n[WiYie(Xi)(1Wi)Yi1e(Xi)]\hat{\tau} = \frac{1}{n}\sum_{i=1}^n \left[\frac{W_i Y_i}{e(X_i)} - \frac{(1-W_i) Y_i}{1-e(X_i)}\right] is unbiased for the ATE under ignorability and overlap when the propensity score is known.

ExerciseResearch

Problem

Instrumental variables (IV) estimation does not require ignorability. Instead, it uses an instrument ZZ that affects treatment WW but has no direct effect on outcome YY. State the IV assumptions precisely and derive the Wald estimator for the case where both ZZ and WW are binary.

References

Canonical:

  • Rubin, Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies (1974)
  • Rosenbaum & Rubin, The Central Role of the Propensity Score in Observational Studies (1983)

Current:

  • Imbens & Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences (2015)

  • Chernozhukov et al., Double/Debiased Machine Learning for Treatment and Structural Parameters (Econometrica, 2018)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Next Topics

Causal inference connects to many ML topics:

  • Treatment effect estimation with machine learning (causal forests, CATE)
  • Fairness as a causal concept
  • Counterfactual reasoning in model explanations

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.