Causal Inference and the Ladder of Causation

Sneiderman, Robby

Methodology

Causal Inference and the Ladder of Causation

Pearl's three-level hierarchy: association, intervention, counterfactual. Structural causal models, do-calculus, the adjustment formula, and why prediction is not causation.

AdvancedTier 1StableCore spine~50 min

Prerequisites

Common Probability Distributions Bayesian Estimation Causal Inference Basics Double Debiased Machine Learning

Start 8-question practice · 9 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

methodology | layer 3 | tier 1. This page has 4 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Decision Theory Foundations

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A prediction model tells you $P(Y \mid X)$ : given observed features, what outcome is likely. A causal model tells you $P(Y \mid do(X))$ : if you intervene to set $X$ to a particular value, what happens to $Y$ . These are different quantities. Knowing that umbrellas correlate with rain does not mean distributing umbrellas will cause rain.

Pearl's framework provides the formal machinery for distinguishing association from causation. It defines three levels of causal reasoning, shows that each level requires strictly more information than the one below, and gives algorithms (do-calculus) for computing causal and counterfactual quantities from a combination of data and assumptions encoded in a directed acyclic graph (DAG).

For ML practitioners, this matters directly. Feature importance scores measure association, not causation. Attention weights show what the model looks at, not what causes the output. Confounders in training data produce spurious correlations that fail under distribution shift. Causal reasoning is the tool for understanding when a model's learned associations will generalize and when they will break.

Embedded Causal Diagram

The key move is to block the backdoor path

The same data can support two different quantities: an observed association and a causal effect. What changes is whether the confounder still opens the path from treatment to outcome.

X

Severity

The confounder that changes both treatment assignment and outcome.

W

Treatment

The intervention or decision whose effect you want to estimate.

Y

Outcome

The result after treatment, policy, or exposure.

Observed quantity

P (Y ∣ W) answers: among treated units, how do outcomes differ?

If severity affects both who gets treated and how they do, this mixes the treatment effect with selection.

Causal target

P (Y ∣ d o (W)) asks what would change if treatment were actively set.

Randomization cuts the confounder-to-treatment arrow by design. In observational work, you try to block the same path with the right adjustment set.

The Ladder of Causation

Pearl organizes causal reasoning into three levels, each requiring strictly stronger assumptions than the previous.

Definition

Level 1: Association (Seeing)

Association answers questions of the form: given that I observe $X = x$ , what is the probability of $Y = y$ ?

$P(Y = y \mid X = x)$

This is the domain of standard statistics and machine learning. It requires only observational data. Regression, classification, and density estimation all operate at this level. Association captures correlation, conditional probability, and prediction, but cannot distinguish causes from effects.

Definition

Level 2: Intervention (Doing)

Intervention answers questions of the form: if I actively set $X$ to $x$ (regardless of what $X$ would have been), what is the probability of $Y = y$ ?

$P(Y = y \mid do(X = x))$

This is the domain of experiments and policy decisions. The $do(\cdot)$ operator distinguishes intervention from observation. $P(Y \mid X = x)$ may differ from $P(Y \mid do(X = x))$ whenever there exist confounders that affect both $X$ and $Y$ . Answering interventional questions from observational data requires causal assumptions, typically encoded as a DAG.

Definition

Level 3: Counterfactual (Imagining)

Counterfactual answers questions of the form: given that I observed $X = x'$ and $Y = y'$ , what would $Y$ have been if $X$ had been $x$ instead?

$P(Y_x = y \mid X = x', Y = y')$

This is the domain of attribution, regret, and individual-level reasoning. Counterfactuals require the full structural causal model, not just the DAG. They condition on what actually happened and ask about what would have happened under a different intervention. This level is strictly more informative than Level 2: two SCMs can agree on all interventional distributions but disagree on counterfactuals.

The hierarchy is strict: Level 1 information cannot answer Level 2 questions (without additional assumptions), and Level 2 information cannot answer Level 3 questions (without the full SCM). Each level requires a stronger model of the data-generating process.

Structural Causal Models

Definition

Structural Causal Model (SCM) $M = (U, V, F, P (U))$

A structural causal model $M$ consists of:

$U$ : a set of exogenous (background) variables, determined outside the model
$V = \{V_1, \ldots, V_n\}$ : a set of endogenous variables, determined inside the model
$F = \{f_1, \ldots, f_n\}$ : a set of structural equations, where each $V_i = f_i(\text{pa}(V_i), U_i)$ expresses $V_i$ as a function of its parents and an exogenous noise term $U_i$
$P(U)$ : a probability distribution over the exogenous variables

The structural equations are asymmetric: $V_i = f_i(\text{pa}(V_i), U_i)$ means the parents cause $V_i$ , not that $V_i$ causes its parents. This asymmetry distinguishes structural equations from statistical regression equations.

Definition

Causal DAG

The causal DAG $G$ associated with an SCM $M$ has one node for each endogenous variable $V_i$ and a directed edge from $V_j$ to $V_i$ whenever $V_j$ appears in the structural equation for $V_i$ . The DAG encodes the qualitative causal structure: which variables directly cause which others. The DAG does not encode the functional forms or noise distributions.

Example

Drug, Recovery, and Age

Consider three variables: age ( $A$ ), drug treatment ( $D$ ), and recovery ( $R$ ). Structural equations:

$A = U_A, \quad D = f_D(A, U_D), \quad R = f_R(A, D, U_R)$

Age causally affects both treatment assignment (doctors prescribe differently for older patients) and recovery (older patients recover more slowly). The DAG has edges $A \to D$ , $A \to R$ , and $D \to R$ . Age is a confounder for the effect of $D$ on $R$ .

The observational distribution $P(R \mid D)$ conflates the drug's causal effect with the confounding through age. The interventional distribution $P(R \mid do(D))$ isolates the drug's causal effect by conceptually randomizing treatment, breaking the $A \to D$ arrow.

The do-Operator and Truncated Factorization

Definition

The do-Operator

The do-operator $do(X = x)$ represents an external intervention that sets variable $X$ to value $x$ , overriding the structural equation for $X$ . In the modified SCM $M_x$ , the equation for $X$ is replaced by $X = x$ , and all other equations remain unchanged.

The interventional distribution is defined as:

$P(Y = y \mid do(X = x)) = P_{M_x}(Y = y)$

where $P_{M_x}$ is the distribution induced by the modified model.

Definition

Truncated Factorization

In a causal DAG with variables $V_1, \ldots, V_n$ , the observational distribution factorizes as:

$P(v_1, \ldots, v_n) = \prod_{i=1}^n P(v_i \mid \text{pa}(v_i))$

Under intervention $do(X = x)$ , the interventional distribution is obtained by deleting the factor for $X$ and substituting $X = x$ :

$P(v_1, \ldots, v_n \mid do(X = x)) = \prod_{i: V_i \neq X} P(v_i \mid \text{pa}(v_i)) \bigg|_{X=x}$

This is the truncated factorization formula. It formalizes the idea that intervention breaks the causal mechanism that normally determines $X$ while leaving all other mechanisms intact.

d-Separation

Definition

d-Separation

In a DAG $G$ , a path between nodes $X$ and $Y$ is blocked by a set of nodes $Z$ if it contains:

A chain $A \to B \to C$ or fork $A \leftarrow B \to C$ where $B \in Z$ , or
A collider $A \to B \leftarrow C$ where $B \notin Z$ and no descendant of $B$ is in $Z$ .

$X$ and $Y$ are d-separated by $Z$ in $G$ (written $X \perp_G Y \mid Z$ ) if and only if every path between $X$ and $Y$ is blocked by $Z$ .

Under the causal Markov property (which follows from the SCM construction), d-separation in the DAG implies conditional independence in the distribution: $X \perp_G Y \mid Z \implies X \perp Y \mid Z$ . The faithfulness assumption is the converse direction — that every conditional independence in the distribution corresponds to a d-separation in the DAG — and is a separate, stronger assumption that can fail when multiple causal effects cancel exactly.

d-Separation is the graphical criterion that determines which conditional independence relations hold in the observational distribution generated by an SCM. It is the tool for determining whether a set of covariates is sufficient to block confounding paths.

The Backdoor Criterion

Theorem

Backdoor Criterion and Adjustment Formula

Statement

A set of variables $Z$ satisfies the backdoor criterion relative to an ordered pair $(X, Y)$ in a DAG $G$ if and only if:

No node in $Z$ is a descendant of $X$ .
$Z$ blocks every path between $X$ and $Y$ that contains an arrow into $X$ (a "backdoor path").

If $Z$ satisfies the backdoor criterion, then the causal effect of $X$ on $Y$ is identifiable and given by the adjustment formula:

$P(Y = y \mid do(X = x)) = \sum_z P(Y = y \mid X = x, Z = z) \, P(Z = z)$

For continuous variables, the sum becomes an integral.

Intuition

Backdoor paths are non-causal paths from $X$ to $Y$ that flow through common causes (confounders). These paths create spurious associations between $X$ and $Y$ in the observational data. The backdoor criterion identifies sets of variables $Z$ that, when conditioned on, block all these spurious paths without blocking any causal paths. The adjustment formula computes the interventional distribution by stratifying on $Z$ : within each stratum, the remaining association between $X$ and $Y$ is causal.

Proof Sketch

By the truncated factorization formula:

$P(y \mid do(x)) = \sum_{v \setminus \{x, y\}} \prod_{i: V_i \neq X} P(v_i \mid \text{pa}(v_i))\bigg|_{X=x}$

Partition the variables into $Z$ (the adjustment set) and the rest. Condition (1) ensures that conditioning on $Z$ does not block any causal path from $X$ to $Y$ (no descendants of $X$ in $Z$ ). Condition (2) ensures that conditioning on $Z$ blocks all non-causal (backdoor) paths. Under these conditions, the marginalization over the non- $Z$ variables can be factored, and the result simplifies to the adjustment formula. The key step uses d-separation: after conditioning on $Z$ , $X$ is d-separated from all non-descendants that are not on a causal path, so the truncated product reduces to $P(Y \mid X, Z)$ weighted by $P(Z)$ .

Why It Matters

The backdoor criterion provides an actionable test: given a proposed DAG, check whether a set of measured covariates blocks all backdoor paths. If yes, you can estimate the causal effect from observational data using standard regression or stratification. If no such set exists among the measured variables, the causal effect is not identifiable by backdoor adjustment (though it may still be identifiable by other methods such as the front-door criterion or instrumental variables).

Failure Mode

The backdoor criterion assumes the DAG is correct. If the true causal structure differs from the assumed DAG (a missing edge, a reversed arrow, an unmeasured confounder), the adjustment formula gives a biased estimate of the causal effect. The DAG itself is an assumption, not something that can be fully verified from data. Domain knowledge is required to specify it.

Conditioning on a descendant of $X$ (violating condition 1) introduces post-treatment bias. Conditioning on a collider opens a non-causal path and introduces collider bias. Both are common errors in applied work.

report a correction →

The Front-Door Criterion

Definition

Front-Door Criterion

A set of variables $M$ satisfies the front-door criterion relative to $(X, Y)$ if and only if:

$M$ intercepts all directed paths from $X$ to $Y$ .
There is no unblocked backdoor path from $X$ to $M$ .
All backdoor paths from $M$ to $Y$ are blocked by $X$ .

If the front-door criterion is satisfied:

$P(Y = y \mid do(X = x)) = \sum_m P(M = m \mid X = x) \sum_{x'} P(Y = y \mid X = x', M = m) \, P(X = x')$

The front-door criterion is useful when there is an unmeasured confounder between $X$ and $Y$ , but the causal effect is mediated entirely through a measurable intermediate variable $M$ .

Example

Smoking, Tar, and Cancer

Consider the classic example: smoking ( $X$ ) causes tar deposits ( $M$ ), and tar causes cancer ( $Y$ ). There may be an unmeasured genetic factor ( $U$ ) that causes both smoking tendency and cancer risk. The DAG has edges $X \to M \to Y$ and $U \to X$ , $U \to Y$ .

The backdoor criterion fails for the pair $(X, Y)$ because $U$ is unmeasured. But the front-door criterion is satisfied by $M = \text{tar}$ : tar intercepts all directed paths from smoking to cancer, there is no backdoor path from smoking to tar (the $U \to X$ arrow is into $X$ , not into $M$ ), and all backdoor paths from tar to cancer are blocked by $X$ . The front-door formula identifies the causal effect of smoking on cancer even with the unmeasured confounder.

Simpson's Paradox as a Causal Phenomenon

Simpson's paradox occurs when a statistical trend that appears in several groups reverses when the groups are combined (or vice versa). The "paradox" is not a statistical error. It is a signal that the data involve confounding and that the correct analysis depends on the causal structure.

Example

Simpson's Paradox in Treatment Data

A drug appears effective overall: recovery rate is higher in the treated group (73%) than the untreated group (69%). But within each gender subgroup, the drug appears harmful: treated men recover less often than untreated men (70% vs. 80%), and treated women recover less often than untreated women (20% vs. 30%).

The resolution depends on the causal DAG. If gender is a confounder (it affects both treatment assignment and recovery), then the stratified analysis is correct and the drug is harmful. If gender is a mediator (treatment affects recovery partly through gender-related mechanisms), then the aggregated analysis is correct. Simpson's paradox shows that statistical tables alone cannot answer causal questions. You need the DAG.

What Pearl's Framework Does NOT Do

Pearl's framework provides the language and calculus for answering causal questions given a causal model (DAG or SCM). It does not solve the following problems:

Model specification: the framework does not tell you what the correct DAG is. That requires domain knowledge, prior experiments, or causal discovery algorithms (which have strong assumptions of their own).
Causal discovery from data alone: while constraint-based algorithms (PC, FCI) and score-based algorithms (GES) can learn DAG structure from data under strong assumptions (faithfulness, causal sufficiency), these assumptions frequently fail in practice. Data alone cannot distinguish between Markov-equivalent DAGs.
Finite-sample estimation: the identification formulas (adjustment, front-door) tell you what to estimate, not how well you can estimate it. Estimation efficiency, confidence intervals, and sensitivity analysis require separate statistical tools.
Unmeasured confounders: if the causal effect is not identifiable from the observed variables (no valid adjustment set, no front-door path, no instrument), Pearl's framework tells you the problem is unsolvable with the given data. It does not manufacture a solution.

Connections to ML

Feature importance is not causal. SHAP values, permutation importance, and gradient-based saliency measure how much a feature contributes to the model's prediction. They do not measure how much changing that feature in the real world would change the outcome. A model that uses "hospital ID" to predict mortality is not telling you that the hospital causes death.

Attention weights are not explanations. Attention weights show where the model allocates computation. They do not indicate causal relationships between input tokens and the output. Two models can have identical predictions with different attention patterns.

Confounding in observational ML. Models trained on observational data learn associations, including spurious ones created by confounders. A model trained to predict recidivism from criminal records inherits confounding from the criminal justice system (e.g., over-policing of certain areas creates more arrest records, not more crime). Distribution shift often breaks exactly those associations that were confounded.

Causal fairness. In causal-fairness analyses, a prediction is treated as fair only if it does not depend on protected attributes through prohibited causal pathways. This requires specifying the causal DAG and defining fairness as a constraint on the causal effect of the protected attribute, not merely on the statistical association.

Common Confusions

Watch Out

Pearl does not claim all questions are causal

Pearl does not claim all questions are causal. His hierarchy shows that some questions require causal assumptions that no amount of data can substitute for. The point is not "use DAGs everywhere" but "know which questions your data can and cannot answer." A prediction task (Level 1) does not need causal reasoning. An intervention question (Level 2) does. Conflating the two is the error.

Watch Out

DAGs do not prove causation from data

A DAG is a set of causal assumptions, not a conclusion derived from data. Drawing arrows in a DAG does not make the causal claims true. The DAG must be justified by domain knowledge, prior experiments, or explicit argument. The framework's value is in making these assumptions explicit and testable (via d-separation implications), not in eliminating the need for them.

Watch Out

Correlation is not causation, but neither is regression

Adding control variables to a regression does not automatically yield causal estimates. If you control for a collider, you introduce bias. If you control for a mediator, you block part of the causal effect. The choice of what to control for must be guided by the causal DAG, not by statistical criteria like p-values or model fit. "Adjusting for everything" is not a valid causal strategy.

Watch Out

The Rubin and Pearl frameworks are not in opposition

The potential outcomes framework (Rubin) and the structural causal model framework (Pearl) address the same problems with different notation. Potential outcomes $Y(x)$ correspond to counterfactual values in an SCM. The backdoor criterion gives conditions under which the Rubin-style ignorability assumption holds. The frameworks are complementary, not competing. Pearl's DAGs make assumptions visible; Rubin's potential outcomes make estimands precise.

Exercises

ExerciseCore

Problem

Consider a DAG with three variables: $X \to Y$ and $Z \to X$ , $Z \to Y$ (so $Z$ is a common cause of $X$ and $Y$ ). Does the empty set satisfy the backdoor criterion for the effect of $X$ on $Y$ ? Does $\{Z\}$ satisfy it? Write out the adjustment formula for the case where $\{Z\}$ satisfies the criterion.

ExerciseCore

Problem

In the DAG $X \to M \to Y$ with $X \to Y$ (direct and indirect effects), does $\{M\}$ satisfy the backdoor criterion for the total effect of $X$ on $Y$ ? Explain why or why not.

ExerciseAdvanced

Problem

Prove that the truncated factorization formula is equivalent to the adjustment formula when the backdoor criterion is satisfied. Start from the truncated factorization for $P(y \mid do(x))$ in a DAG with variables $\{X, Y, Z\}$ and edges $Z \to X$ , $Z \to Y$ , $X \to Y$ .

ExerciseResearch

Problem

Consider an ML model trained to predict $Y$ from $(X, Z)$ on observational data. The true causal DAG has an unmeasured confounder $U$ with $U \to X$ and $U \to Y$ , plus the paths $X \to Y$ and $Z \to Y$ (with $Z$ unconfounded). The model achieves low prediction error on the training distribution. Under what distribution shifts will the model fail, and why? Relate your answer to the distinction between $P(Y \mid X)$ and $P(Y \mid do(X))$ .

References

Canonical:

Pearl, Causality: Models, Reasoning, and Inference (2nd ed., 2009), Chapters 1-4, 7
Pearl, Glymour, and Jewell, Causal Inference in Statistics: A Primer (2016), Chapters 1-4

Technical foundations:

Pearl, "Causal Diagrams for Empirical Research," Biometrika 82(4), 1995
Tian and Pearl, "On the Identification of Causal Effects," UAI 2002

Connections to potential outcomes:

Imbens and Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences (2015), Chapters 1-3
Richardson and Robins, "Single World Intervention Graphs," CSSS Working Paper 128, 2013

ML applications:

Peters, Janzing, and Scholkopf, Elements of Causal Inference (2017), Chapters 1-4, 6
Scholkopf et al., "Toward Causal Representation Learning," Proceedings of the IEEE 109(5), 2021

Next Topics

Natural extensions from Pearl's causal framework:

Decision theory foundations: the formal framework for choosing actions under uncertainty, which causal reasoning supports

Last reviewed: April 17, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Common Probability Distributionslayer 0A · tier 1
Double/Debiased Machine Learninglayer 3 · tier 1
Bayesian Estimationlayer 0B · tier 2
Causal Inference Basicslayer 3 · tier 3

Derived topics

2

Decision Theory Foundationslayer 2 · tier 2
Causal Inference for Policy Evaluationlayer 4 · tier 2

Graph-backed continuations

Decision Theory Foundations Causal Inference for Policy Evaluation