Methodology
Causal Inference and the Ladder of Causation
Pearl's three-level hierarchy: association, intervention, counterfactual. Structural causal models, do-calculus, the adjustment formula, and why prediction is not causation.
Prerequisites
Why This Matters
A prediction model tells you : given observed features, what outcome is likely. A causal model tells you : if you intervene to set to a particular value, what happens to . These are different quantities. Knowing that umbrellas correlate with rain does not mean distributing umbrellas will cause rain.
Pearl's framework provides the formal machinery for distinguishing association from causation. It defines three levels of causal reasoning, shows that each level requires strictly more information than the one below, and gives algorithms (do-calculus) for computing causal and counterfactual quantities from a combination of data and assumptions encoded in a directed acyclic graph (DAG).
For ML practitioners, this matters directly. Feature importance scores measure association, not causation. Attention weights show what the model looks at, not what causes the output. Confounders in training data produce spurious correlations that fail under distribution shift. Causal reasoning is the tool for understanding when a model's learned associations will generalize and when they will break.
The Ladder of Causation
Pearl organizes causal reasoning into three levels, each requiring strictly stronger assumptions than the previous.
Level 1: Association (Seeing)
Association answers questions of the form: given that I observe , what is the probability of ?
This is the domain of standard statistics and machine learning. It requires only observational data. Regression, classification, and density estimation all operate at this level. Association captures correlation, conditional probability, and prediction, but cannot distinguish causes from effects.
Level 2: Intervention (Doing)
Intervention answers questions of the form: if I actively set to (regardless of what would have been), what is the probability of ?
This is the domain of experiments and policy decisions. The operator distinguishes intervention from observation. may differ from whenever there exist confounders that affect both and . Answering interventional questions from observational data requires causal assumptions, typically encoded as a DAG.
Level 3: Counterfactual (Imagining)
Counterfactual answers questions of the form: given that I observed and , what would have been if had been instead?
This is the domain of attribution, regret, and individual-level reasoning. Counterfactuals require the full structural causal model, not just the DAG. They condition on what actually happened and ask about what would have happened under a different intervention. This level is strictly more informative than Level 2: two SCMs can agree on all interventional distributions but disagree on counterfactuals.
The hierarchy is strict: Level 1 information cannot answer Level 2 questions (without additional assumptions), and Level 2 information cannot answer Level 3 questions (without the full SCM). Each level requires a stronger model of the data-generating process.
Structural Causal Models
Structural Causal Model (SCM)
A structural causal model consists of:
- : a set of exogenous (background) variables, determined outside the model
- : a set of endogenous variables, determined inside the model
- : a set of structural equations, where each expresses as a function of its parents and an exogenous noise term
- : a probability distribution over the exogenous variables
The structural equations are asymmetric: means the parents cause , not that causes its parents. This asymmetry distinguishes structural equations from statistical regression equations.
Causal DAG
The causal DAG associated with an SCM has one node for each endogenous variable and a directed edge from to whenever appears in the structural equation for . The DAG encodes the qualitative causal structure: which variables directly cause which others. The DAG does not encode the functional forms or noise distributions.
Drug, Recovery, and Age
Consider three variables: age (), drug treatment (), and recovery (). Structural equations:
Age causally affects both treatment assignment (doctors prescribe differently for older patients) and recovery (older patients recover more slowly). The DAG has edges , , and . Age is a confounder for the effect of on .
The observational distribution conflates the drug's causal effect with the confounding through age. The interventional distribution isolates the drug's causal effect by conceptually randomizing treatment, breaking the arrow.
The do-Operator and Truncated Factorization
The do-Operator
The do-operator represents an external intervention that sets variable to value , overriding the structural equation for . In the modified SCM , the equation for is replaced by , and all other equations remain unchanged.
The interventional distribution is defined as:
where is the distribution induced by the modified model.
Truncated Factorization
In a causal DAG with variables , the observational distribution factorizes as:
Under intervention , the interventional distribution is obtained by deleting the factor for and substituting :
This is the truncated factorization formula. It formalizes the idea that intervention breaks the causal mechanism that normally determines while leaving all other mechanisms intact.
d-Separation
d-Separation
In a DAG , a path between nodes and is blocked by a set of nodes if it contains:
- A chain or fork where , or
- A collider where and no descendant of is in .
and are d-separated by in (written ) if every path between and is blocked by .
Under the faithfulness assumption, d-separation in the DAG implies conditional independence in the distribution: .
d-Separation is the graphical criterion that determines which conditional independence relations hold in the observational distribution generated by an SCM. It is the tool for determining whether a set of covariates is sufficient to block confounding paths.
The Backdoor Criterion
Backdoor Criterion and Adjustment Formula
Statement
A set of variables satisfies the backdoor criterion relative to an ordered pair in a DAG if:
- No node in is a descendant of .
- blocks every path between and that contains an arrow into (a "backdoor path").
If satisfies the backdoor criterion, then the causal effect of on is identifiable and given by the adjustment formula:
For continuous variables, the sum becomes an integral.
Intuition
Backdoor paths are non-causal paths from to that flow through common causes (confounders). These paths create spurious associations between and in the observational data. The backdoor criterion identifies sets of variables that, when conditioned on, block all these spurious paths without blocking any causal paths. The adjustment formula computes the interventional distribution by stratifying on : within each stratum, the remaining association between and is causal.
Proof Sketch
By the truncated factorization formula:
Partition the variables into (the adjustment set) and the rest. Condition (1) ensures that conditioning on does not block any causal path from to (no descendants of in ). Condition (2) ensures that conditioning on blocks all non-causal (backdoor) paths. Under these conditions, the marginalization over the non- variables can be factored, and the result simplifies to the adjustment formula. The key step uses d-separation: after conditioning on , is d-separated from all non-descendants that are not on a causal path, so the truncated product reduces to weighted by .
Why It Matters
The backdoor criterion provides an actionable test: given a proposed DAG, check whether a set of measured covariates blocks all backdoor paths. If yes, you can estimate the causal effect from observational data using standard regression or stratification. If no such set exists among the measured variables, the causal effect is not identifiable by backdoor adjustment (though it may still be identifiable by other methods such as the front-door criterion or instrumental variables).
Failure Mode
The backdoor criterion assumes the DAG is correct. If the true causal structure differs from the assumed DAG (a missing edge, a reversed arrow, an unmeasured confounder), the adjustment formula gives a biased estimate of the causal effect. The DAG itself is an assumption, not something that can be fully verified from data. Domain knowledge is required to specify it.
Conditioning on a descendant of (violating condition 1) introduces post-treatment bias. Conditioning on a collider opens a non-causal path and introduces collider bias. Both are common errors in applied work.
The Front-Door Criterion
Front-Door Criterion
A set of variables satisfies the front-door criterion relative to if:
- intercepts all directed paths from to .
- There is no unblocked backdoor path from to .
- All backdoor paths from to are blocked by .
If the front-door criterion is satisfied:
The front-door criterion is useful when there is an unmeasured confounder between and , but the causal effect is mediated entirely through a measurable intermediate variable .
Smoking, Tar, and Cancer
Consider the classic example: smoking () causes tar deposits (), and tar causes cancer (). There may be an unmeasured genetic factor () that causes both smoking tendency and cancer risk. The DAG has edges and , .
The backdoor criterion fails for the pair because is unmeasured. But the front-door criterion is satisfied by : tar intercepts all directed paths from smoking to cancer, there is no backdoor path from smoking to tar (the arrow is into , not into ), and all backdoor paths from tar to cancer are blocked by . The front-door formula identifies the causal effect of smoking on cancer even with the unmeasured confounder.
Simpson's Paradox as a Causal Phenomenon
Simpson's paradox occurs when a statistical trend that appears in several groups reverses when the groups are combined (or vice versa). The "paradox" is not a statistical error. It is a signal that the data involve confounding and that the correct analysis depends on the causal structure.
Simpson's Paradox in Treatment Data
A drug appears effective overall: recovery rate is higher in the treated group (73%) than the untreated group (69%). But within each gender subgroup, the drug appears harmful: treated men recover less often than untreated men (70% vs. 80%), and treated women recover less often than untreated women (20% vs. 30%).
The resolution depends on the causal DAG. If gender is a confounder (it affects both treatment assignment and recovery), then the stratified analysis is correct and the drug is harmful. If gender is a mediator (treatment affects recovery partly through gender-related mechanisms), then the aggregated analysis is correct. Simpson's paradox shows that statistical tables alone cannot answer causal questions. You need the DAG.
What Pearl's Framework Does NOT Do
Pearl's framework provides the language and calculus for answering causal questions given a causal model (DAG or SCM). It does not solve the following problems:
-
Model specification: the framework does not tell you what the correct DAG is. That requires domain knowledge, prior experiments, or causal discovery algorithms (which have strong assumptions of their own).
-
Causal discovery from data alone: while constraint-based algorithms (PC, FCI) and score-based algorithms (GES) can learn DAG structure from data under strong assumptions (faithfulness, causal sufficiency), these assumptions frequently fail in practice. Data alone cannot distinguish between Markov-equivalent DAGs.
-
Finite-sample estimation: the identification formulas (adjustment, front-door) tell you what to estimate, not how well you can estimate it. Estimation efficiency, confidence intervals, and sensitivity analysis require separate statistical tools.
-
Unmeasured confounders: if the causal effect is not identifiable from the observed variables (no valid adjustment set, no front-door path, no instrument), Pearl's framework tells you the problem is unsolvable with the given data. It does not manufacture a solution.
Connections to ML
Feature importance is not causal. SHAP values, permutation importance, and gradient-based saliency measure how much a feature contributes to the model's prediction. They do not measure how much changing that feature in the real world would change the outcome. A model that uses "hospital ID" to predict mortality is not telling you that the hospital causes death.
Attention weights are not explanations. Attention weights show where the model allocates computation. They do not indicate causal relationships between input tokens and the output. Two models can have identical predictions with different attention patterns.
Confounding in observational ML. Models trained on observational data learn associations, including spurious ones created by confounders. A model trained to predict recidivism from criminal records inherits confounding from the criminal justice system (e.g., over-policing of certain areas creates more arrest records, not more crime). Distribution shift often breaks exactly those associations that were confounded.
Causal fairness. A prediction is fair in the causal sense if it does not depend on protected attributes through prohibited causal pathways. This requires specifying the causal DAG and defining fairness as a constraint on the causal effect of the protected attribute, not merely on the statistical association.
Common Confusions
Pearl does not claim all questions are causal
Pearl does not claim all questions are causal. His hierarchy shows that some questions require causal assumptions that no amount of data can substitute for. The point is not "use DAGs everywhere" but "know which questions your data can and cannot answer." A prediction task (Level 1) does not need causal reasoning. An intervention question (Level 2) does. Conflating the two is the error.
DAGs do not prove causation from data
A DAG is a set of causal assumptions, not a conclusion derived from data. Drawing arrows in a DAG does not make the causal claims true. The DAG must be justified by domain knowledge, prior experiments, or explicit argument. The framework's value is in making these assumptions explicit and testable (via d-separation implications), not in eliminating the need for them.
Correlation is not causation, but neither is regression
Adding control variables to a regression does not automatically yield causal estimates. If you control for a collider, you introduce bias. If you control for a mediator, you block part of the causal effect. The choice of what to control for must be guided by the causal DAG, not by statistical criteria like p-values or model fit. "Adjusting for everything" is not a valid causal strategy.
The Rubin and Pearl frameworks are not in opposition
The potential outcomes framework (Rubin) and the structural causal model framework (Pearl) address the same problems with different notation. Potential outcomes correspond to counterfactual values in an SCM. The backdoor criterion gives conditions under which the Rubin-style ignorability assumption holds. The frameworks are complementary, not competing. Pearl's DAGs make assumptions visible; Rubin's potential outcomes make estimands precise.
Exercises
Problem
Consider a DAG with three variables: and , (so is a common cause of and ). Does the empty set satisfy the backdoor criterion for the effect of on ? Does satisfy it? Write out the adjustment formula for the case where satisfies the criterion.
Problem
In the DAG with (direct and indirect effects), does satisfy the backdoor criterion for the total effect of on ? Explain why or why not.
Problem
Prove that the truncated factorization formula is equivalent to the adjustment formula when the backdoor criterion is satisfied. Start from the truncated factorization for in a DAG with variables and edges , , .
Problem
Consider an ML model trained to predict from on observational data. The true causal DAG has an unmeasured confounder with and , plus the paths and (with unconfounded). The model achieves low prediction error on the training distribution. Under what distribution shifts will the model fail, and why? Relate your answer to the distinction between and .
References
Canonical:
- Pearl, Causality: Models, Reasoning, and Inference (2nd ed., 2009), Chapters 1-4, 7
- Pearl, Glymour, and Jewell, Causal Inference in Statistics: A Primer (2016), Chapters 1-4
Technical foundations:
- Pearl, "Causal Diagrams for Empirical Research," Biometrika 82(4), 1995
- Tian and Pearl, "On the Identification of Causal Effects," UAI 2002
Connections to potential outcomes:
- Imbens and Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences (2015), Chapters 1-3
- Richardson and Robins, "Single World Intervention Graphs," CSSS Working Paper 128, 2013
ML applications:
- Peters, Janzing, and Scholkopf, Elements of Causal Inference (2017), Chapters 1-4, 6
- Scholkopf et al., "Toward Causal Representation Learning," Proceedings of the IEEE 109(5), 2021
Next Topics
Natural extensions from Pearl's causal framework:
- Decision theory foundations: the formal framework for choosing actions under uncertainty, which causal reasoning supports
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Bayesian EstimationLayer 0B
- Maximum Likelihood EstimationLayer 0B
- Differentiation in RnLayer 0A