Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Causal Inference and the Ladder of Causation

Pearl's three-level hierarchy: association, intervention, counterfactual. Structural causal models, do-calculus, the adjustment formula, and why prediction is not causation.

AdvancedTier 1Stable~50 min

Why This Matters

A prediction model tells you P(YX)P(Y \mid X): given observed features, what outcome is likely. A causal model tells you P(Ydo(X))P(Y \mid do(X)): if you intervene to set XX to a particular value, what happens to YY. These are different quantities. Knowing that umbrellas correlate with rain does not mean distributing umbrellas will cause rain.

Pearl's framework provides the formal machinery for distinguishing association from causation. It defines three levels of causal reasoning, shows that each level requires strictly more information than the one below, and gives algorithms (do-calculus) for computing causal and counterfactual quantities from a combination of data and assumptions encoded in a directed acyclic graph (DAG).

For ML practitioners, this matters directly. Feature importance scores measure association, not causation. Attention weights show what the model looks at, not what causes the output. Confounders in training data produce spurious correlations that fail under distribution shift. Causal reasoning is the tool for understanding when a model's learned associations will generalize and when they will break.

Observational: P(Y | X)Interventional: P(Y | do(X))ZXYZdo(X)YZ confounds X and Ydo(X) removes confoundingBackdoor criterion: condition on Z to block the confounding path X Z Y

The Ladder of Causation

Pearl organizes causal reasoning into three levels, each requiring strictly stronger assumptions than the previous.

Definition

Level 1: Association (Seeing)

Association answers questions of the form: given that I observe X=xX = x, what is the probability of Y=yY = y?

P(Y=yX=x)P(Y = y \mid X = x)

This is the domain of standard statistics and machine learning. It requires only observational data. Regression, classification, and density estimation all operate at this level. Association captures correlation, conditional probability, and prediction, but cannot distinguish causes from effects.

Definition

Level 2: Intervention (Doing)

Intervention answers questions of the form: if I actively set XX to xx (regardless of what XX would have been), what is the probability of Y=yY = y?

P(Y=ydo(X=x))P(Y = y \mid do(X = x))

This is the domain of experiments and policy decisions. The do()do(\cdot) operator distinguishes intervention from observation. P(YX=x)P(Y \mid X = x) may differ from P(Ydo(X=x))P(Y \mid do(X = x)) whenever there exist confounders that affect both XX and YY. Answering interventional questions from observational data requires causal assumptions, typically encoded as a DAG.

Definition

Level 3: Counterfactual (Imagining)

Counterfactual answers questions of the form: given that I observed X=xX = x' and Y=yY = y', what would YY have been if XX had been xx instead?

P(Yx=yX=x,Y=y)P(Y_x = y \mid X = x', Y = y')

This is the domain of attribution, regret, and individual-level reasoning. Counterfactuals require the full structural causal model, not just the DAG. They condition on what actually happened and ask about what would have happened under a different intervention. This level is strictly more informative than Level 2: two SCMs can agree on all interventional distributions but disagree on counterfactuals.

The hierarchy is strict: Level 1 information cannot answer Level 2 questions (without additional assumptions), and Level 2 information cannot answer Level 3 questions (without the full SCM). Each level requires a stronger model of the data-generating process.

Structural Causal Models

Definition

Structural Causal Model (SCM)

A structural causal model MM consists of:

  • UU: a set of exogenous (background) variables, determined outside the model
  • V={V1,,Vn}V = \{V_1, \ldots, V_n\}: a set of endogenous variables, determined inside the model
  • F={f1,,fn}F = \{f_1, \ldots, f_n\}: a set of structural equations, where each Vi=fi(pa(Vi),Ui)V_i = f_i(\text{pa}(V_i), U_i) expresses ViV_i as a function of its parents and an exogenous noise term UiU_i
  • P(U)P(U): a probability distribution over the exogenous variables

The structural equations are asymmetric: Vi=fi(pa(Vi),Ui)V_i = f_i(\text{pa}(V_i), U_i) means the parents cause ViV_i, not that ViV_i causes its parents. This asymmetry distinguishes structural equations from statistical regression equations.

Definition

Causal DAG

The causal DAG GG associated with an SCM MM has one node for each endogenous variable ViV_i and a directed edge from VjV_j to ViV_i whenever VjV_j appears in the structural equation for ViV_i. The DAG encodes the qualitative causal structure: which variables directly cause which others. The DAG does not encode the functional forms or noise distributions.

Example

Drug, Recovery, and Age

Consider three variables: age (AA), drug treatment (DD), and recovery (RR). Structural equations:

A=UA,D=fD(A,UD),R=fR(A,D,UR)A = U_A, \quad D = f_D(A, U_D), \quad R = f_R(A, D, U_R)

Age causally affects both treatment assignment (doctors prescribe differently for older patients) and recovery (older patients recover more slowly). The DAG has edges ADA \to D, ARA \to R, and DRD \to R. Age is a confounder for the effect of DD on RR.

The observational distribution P(RD)P(R \mid D) conflates the drug's causal effect with the confounding through age. The interventional distribution P(Rdo(D))P(R \mid do(D)) isolates the drug's causal effect by conceptually randomizing treatment, breaking the ADA \to D arrow.

The do-Operator and Truncated Factorization

Definition

The do-Operator

The do-operator do(X=x)do(X = x) represents an external intervention that sets variable XX to value xx, overriding the structural equation for XX. In the modified SCM MxM_x, the equation for XX is replaced by X=xX = x, and all other equations remain unchanged.

The interventional distribution is defined as:

P(Y=ydo(X=x))=PMx(Y=y)P(Y = y \mid do(X = x)) = P_{M_x}(Y = y)

where PMxP_{M_x} is the distribution induced by the modified model.

Definition

Truncated Factorization

In a causal DAG with variables V1,,VnV_1, \ldots, V_n, the observational distribution factorizes as:

P(v1,,vn)=i=1nP(vipa(vi))P(v_1, \ldots, v_n) = \prod_{i=1}^n P(v_i \mid \text{pa}(v_i))

Under intervention do(X=x)do(X = x), the interventional distribution is obtained by deleting the factor for XX and substituting X=xX = x:

P(v1,,vndo(X=x))=i:ViXP(vipa(vi))X=xP(v_1, \ldots, v_n \mid do(X = x)) = \prod_{i: V_i \neq X} P(v_i \mid \text{pa}(v_i)) \bigg|_{X=x}

This is the truncated factorization formula. It formalizes the idea that intervention breaks the causal mechanism that normally determines XX while leaving all other mechanisms intact.

d-Separation

Definition

d-Separation

In a DAG GG, a path between nodes XX and YY is blocked by a set of nodes ZZ if it contains:

  1. A chain ABCA \to B \to C or fork ABCA \leftarrow B \to C where BZB \in Z, or
  2. A collider ABCA \to B \leftarrow C where BZB \notin Z and no descendant of BB is in ZZ.

XX and YY are d-separated by ZZ in GG (written XGYZX \perp_G Y \mid Z) if every path between XX and YY is blocked by ZZ.

Under the faithfulness assumption, d-separation in the DAG implies conditional independence in the distribution: XGYZ    XYZX \perp_G Y \mid Z \implies X \perp Y \mid Z.

d-Separation is the graphical criterion that determines which conditional independence relations hold in the observational distribution generated by an SCM. It is the tool for determining whether a set of covariates is sufficient to block confounding paths.

The Backdoor Criterion

Theorem

Backdoor Criterion and Adjustment Formula

Statement

A set of variables ZZ satisfies the backdoor criterion relative to an ordered pair (X,Y)(X, Y) in a DAG GG if:

  1. No node in ZZ is a descendant of XX.
  2. ZZ blocks every path between XX and YY that contains an arrow into XX (a "backdoor path").

If ZZ satisfies the backdoor criterion, then the causal effect of XX on YY is identifiable and given by the adjustment formula:

P(Y=ydo(X=x))=zP(Y=yX=x,Z=z)P(Z=z)P(Y = y \mid do(X = x)) = \sum_z P(Y = y \mid X = x, Z = z) \, P(Z = z)

For continuous variables, the sum becomes an integral.

Intuition

Backdoor paths are non-causal paths from XX to YY that flow through common causes (confounders). These paths create spurious associations between XX and YY in the observational data. The backdoor criterion identifies sets of variables ZZ that, when conditioned on, block all these spurious paths without blocking any causal paths. The adjustment formula computes the interventional distribution by stratifying on ZZ: within each stratum, the remaining association between XX and YY is causal.

Proof Sketch

By the truncated factorization formula:

P(ydo(x))=v{x,y}i:ViXP(vipa(vi))X=xP(y \mid do(x)) = \sum_{v \setminus \{x, y\}} \prod_{i: V_i \neq X} P(v_i \mid \text{pa}(v_i))\bigg|_{X=x}

Partition the variables into ZZ (the adjustment set) and the rest. Condition (1) ensures that conditioning on ZZ does not block any causal path from XX to YY (no descendants of XX in ZZ). Condition (2) ensures that conditioning on ZZ blocks all non-causal (backdoor) paths. Under these conditions, the marginalization over the non-ZZ variables can be factored, and the result simplifies to the adjustment formula. The key step uses d-separation: after conditioning on ZZ, XX is d-separated from all non-descendants that are not on a causal path, so the truncated product reduces to P(YX,Z)P(Y \mid X, Z) weighted by P(Z)P(Z).

Why It Matters

The backdoor criterion provides an actionable test: given a proposed DAG, check whether a set of measured covariates blocks all backdoor paths. If yes, you can estimate the causal effect from observational data using standard regression or stratification. If no such set exists among the measured variables, the causal effect is not identifiable by backdoor adjustment (though it may still be identifiable by other methods such as the front-door criterion or instrumental variables).

Failure Mode

The backdoor criterion assumes the DAG is correct. If the true causal structure differs from the assumed DAG (a missing edge, a reversed arrow, an unmeasured confounder), the adjustment formula gives a biased estimate of the causal effect. The DAG itself is an assumption, not something that can be fully verified from data. Domain knowledge is required to specify it.

Conditioning on a descendant of XX (violating condition 1) introduces post-treatment bias. Conditioning on a collider opens a non-causal path and introduces collider bias. Both are common errors in applied work.

The Front-Door Criterion

Definition

Front-Door Criterion

A set of variables MM satisfies the front-door criterion relative to (X,Y)(X, Y) if:

  1. MM intercepts all directed paths from XX to YY.
  2. There is no unblocked backdoor path from XX to MM.
  3. All backdoor paths from MM to YY are blocked by XX.

If the front-door criterion is satisfied:

P(Y=ydo(X=x))=mP(M=mX=x)xP(Y=yX=x,M=m)P(X=x)P(Y = y \mid do(X = x)) = \sum_m P(M = m \mid X = x) \sum_{x'} P(Y = y \mid X = x', M = m) \, P(X = x')

The front-door criterion is useful when there is an unmeasured confounder between XX and YY, but the causal effect is mediated entirely through a measurable intermediate variable MM.

Example

Smoking, Tar, and Cancer

Consider the classic example: smoking (XX) causes tar deposits (MM), and tar causes cancer (YY). There may be an unmeasured genetic factor (UU) that causes both smoking tendency and cancer risk. The DAG has edges XMYX \to M \to Y and UXU \to X, UYU \to Y.

The backdoor criterion fails for the pair (X,Y)(X, Y) because UU is unmeasured. But the front-door criterion is satisfied by M=tarM = \text{tar}: tar intercepts all directed paths from smoking to cancer, there is no backdoor path from smoking to tar (the UXU \to X arrow is into XX, not into MM), and all backdoor paths from tar to cancer are blocked by XX. The front-door formula identifies the causal effect of smoking on cancer even with the unmeasured confounder.

Simpson's Paradox as a Causal Phenomenon

Simpson's paradox occurs when a statistical trend that appears in several groups reverses when the groups are combined (or vice versa). The "paradox" is not a statistical error. It is a signal that the data involve confounding and that the correct analysis depends on the causal structure.

Example

Simpson's Paradox in Treatment Data

A drug appears effective overall: recovery rate is higher in the treated group (73%) than the untreated group (69%). But within each gender subgroup, the drug appears harmful: treated men recover less often than untreated men (70% vs. 80%), and treated women recover less often than untreated women (20% vs. 30%).

The resolution depends on the causal DAG. If gender is a confounder (it affects both treatment assignment and recovery), then the stratified analysis is correct and the drug is harmful. If gender is a mediator (treatment affects recovery partly through gender-related mechanisms), then the aggregated analysis is correct. Simpson's paradox shows that statistical tables alone cannot answer causal questions. You need the DAG.

What Pearl's Framework Does NOT Do

Pearl's framework provides the language and calculus for answering causal questions given a causal model (DAG or SCM). It does not solve the following problems:

  1. Model specification: the framework does not tell you what the correct DAG is. That requires domain knowledge, prior experiments, or causal discovery algorithms (which have strong assumptions of their own).

  2. Causal discovery from data alone: while constraint-based algorithms (PC, FCI) and score-based algorithms (GES) can learn DAG structure from data under strong assumptions (faithfulness, causal sufficiency), these assumptions frequently fail in practice. Data alone cannot distinguish between Markov-equivalent DAGs.

  3. Finite-sample estimation: the identification formulas (adjustment, front-door) tell you what to estimate, not how well you can estimate it. Estimation efficiency, confidence intervals, and sensitivity analysis require separate statistical tools.

  4. Unmeasured confounders: if the causal effect is not identifiable from the observed variables (no valid adjustment set, no front-door path, no instrument), Pearl's framework tells you the problem is unsolvable with the given data. It does not manufacture a solution.

Connections to ML

Feature importance is not causal. SHAP values, permutation importance, and gradient-based saliency measure how much a feature contributes to the model's prediction. They do not measure how much changing that feature in the real world would change the outcome. A model that uses "hospital ID" to predict mortality is not telling you that the hospital causes death.

Attention weights are not explanations. Attention weights show where the model allocates computation. They do not indicate causal relationships between input tokens and the output. Two models can have identical predictions with different attention patterns.

Confounding in observational ML. Models trained on observational data learn associations, including spurious ones created by confounders. A model trained to predict recidivism from criminal records inherits confounding from the criminal justice system (e.g., over-policing of certain areas creates more arrest records, not more crime). Distribution shift often breaks exactly those associations that were confounded.

Causal fairness. A prediction is fair in the causal sense if it does not depend on protected attributes through prohibited causal pathways. This requires specifying the causal DAG and defining fairness as a constraint on the causal effect of the protected attribute, not merely on the statistical association.

Common Confusions

Watch Out

Pearl does not claim all questions are causal

Pearl does not claim all questions are causal. His hierarchy shows that some questions require causal assumptions that no amount of data can substitute for. The point is not "use DAGs everywhere" but "know which questions your data can and cannot answer." A prediction task (Level 1) does not need causal reasoning. An intervention question (Level 2) does. Conflating the two is the error.

Watch Out

DAGs do not prove causation from data

A DAG is a set of causal assumptions, not a conclusion derived from data. Drawing arrows in a DAG does not make the causal claims true. The DAG must be justified by domain knowledge, prior experiments, or explicit argument. The framework's value is in making these assumptions explicit and testable (via d-separation implications), not in eliminating the need for them.

Watch Out

Correlation is not causation, but neither is regression

Adding control variables to a regression does not automatically yield causal estimates. If you control for a collider, you introduce bias. If you control for a mediator, you block part of the causal effect. The choice of what to control for must be guided by the causal DAG, not by statistical criteria like p-values or model fit. "Adjusting for everything" is not a valid causal strategy.

Watch Out

The Rubin and Pearl frameworks are not in opposition

The potential outcomes framework (Rubin) and the structural causal model framework (Pearl) address the same problems with different notation. Potential outcomes Y(x)Y(x) correspond to counterfactual values in an SCM. The backdoor criterion gives conditions under which the Rubin-style ignorability assumption holds. The frameworks are complementary, not competing. Pearl's DAGs make assumptions visible; Rubin's potential outcomes make estimands precise.

Exercises

ExerciseCore

Problem

Consider a DAG with three variables: XYX \to Y and ZXZ \to X, ZYZ \to Y (so ZZ is a common cause of XX and YY). Does the empty set satisfy the backdoor criterion for the effect of XX on YY? Does {Z}\{Z\} satisfy it? Write out the adjustment formula for the case where {Z}\{Z\} satisfies the criterion.

ExerciseCore

Problem

In the DAG XMYX \to M \to Y with XYX \to Y (direct and indirect effects), does {M}\{M\} satisfy the backdoor criterion for the total effect of XX on YY? Explain why or why not.

ExerciseAdvanced

Problem

Prove that the truncated factorization formula is equivalent to the adjustment formula when the backdoor criterion is satisfied. Start from the truncated factorization for P(ydo(x))P(y \mid do(x)) in a DAG with variables {X,Y,Z}\{X, Y, Z\} and edges ZXZ \to X, ZYZ \to Y, XYX \to Y.

ExerciseResearch

Problem

Consider an ML model trained to predict YY from (X,Z)(X, Z) on observational data. The true causal DAG has an unmeasured confounder UU with UXU \to X and UYU \to Y, plus the paths XYX \to Y and ZYZ \to Y (with ZZ unconfounded). The model achieves low prediction error on the training distribution. Under what distribution shifts will the model fail, and why? Relate your answer to the distinction between P(YX)P(Y \mid X) and P(Ydo(X))P(Y \mid do(X)).

References

Canonical:

  • Pearl, Causality: Models, Reasoning, and Inference (2nd ed., 2009), Chapters 1-4, 7
  • Pearl, Glymour, and Jewell, Causal Inference in Statistics: A Primer (2016), Chapters 1-4

Technical foundations:

  • Pearl, "Causal Diagrams for Empirical Research," Biometrika 82(4), 1995
  • Tian and Pearl, "On the Identification of Causal Effects," UAI 2002

Connections to potential outcomes:

  • Imbens and Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences (2015), Chapters 1-3
  • Richardson and Robins, "Single World Intervention Graphs," CSSS Working Paper 128, 2013

ML applications:

  • Peters, Janzing, and Scholkopf, Elements of Causal Inference (2017), Chapters 1-4, 6
  • Scholkopf et al., "Toward Causal Representation Learning," Proceedings of the IEEE 109(5), 2021

Next Topics

Natural extensions from Pearl's causal framework:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics