Causal Inference for Policy Evaluation

Sneiderman, Robby

Applied ML

Causal Inference for Policy Evaluation

Quasi-experimental methods for recovering policy effects without randomization. Difference-in-differences identifies the average treatment effect on the treated under parallel trends; regression discontinuity identifies a local average treatment effect under continuity at the cutoff; instrumental variables identifies a local average treatment effect for compliers under monotonicity (Imbens-Angrist 1994). Synthetic control and double/debiased ML extend these designs to single-unit and high-dimensional settings.

AdvancedTier 2CurrentSupporting~35 min

Prerequisites

Causal Inference Basics Causal Inference Pearl Hypothesis Testing for ML

Prereq Map

Learning position

Read this page in the graph.

applied-ml | layer 4 | tier 2. This page has 3 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Design-Based vs. Model-Based Inference

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Most policy questions cannot be answered by a randomized trial. A minimum-wage change, a school reform, a tariff, a tax credit: the intervention is allocated by legislatures, geography, or eligibility cutoffs, and the analyst sees one realization. The credibility revolution (Angrist and Pischke 2010) reorganized empirical economics around research designs that recover causal effects under transparent, testable assumptions rather than around structural models that require many more.

Three identification results carry most of the load: difference-in-differences (DiD) under parallel trends, regression discontinuity (RDD) under continuity at the cutoff, and instrumental variables (IV) under exogeneity and monotonicity. Each is a theorem of the form "under assumption A, the estimand $\theta$ equals the causal quantity of interest." Knowing the theorem makes the assumption visible; a Stata xtreg does not.

This page states the three identification theorems precisely, sketches the proofs, and shows the failure modes that show up in real policy work. It also covers synthetic control and double/debiased ML, which extend the classical designs to single-unit and high-dimensional regimes.

Setup: potential outcomes

Use the Neyman-Rubin potential-outcomes framework. For each unit $i$ and treatment status $d \in \{0,1\}$ , let $Y_i(d)$ denote the potential outcome under treatment $d$ . Only one is observed: $Y_i = D_i Y_i(1) + (1-D_i) Y_i(0)$ . The fundamental quantities are:

Average treatment effect (ATE): $\tau = \mathbb{E}[Y_i(1) - Y_i(0)]$ .
Average treatment effect on the treated (ATT): $\tau_{\mathrm{ATT}} = \mathbb{E}[Y_i(1) - Y_i(0) \mid D_i = 1]$ .
Local average treatment effect (LATE): the ATE on a specific subpopulation of compliers, defined below.

Without randomization, none of these is identified by the observed joint $(Y_i, D_i)$ alone. Each design adds a structural assumption that closes the gap.

Difference-in-differences

Setting: panel data with two periods $t \in \{0,1\}$ and a treatment group $G \in \{0,1\}$ that receives treatment only in $t=1$ . Observed outcomes $Y_{i,t}$ for $i$ in either group. Define $Y_{i,t}(d)$ as the potential outcome at time $t$ under treatment status $d$ .

Theorem

DiD identification of the ATT under parallel trends

Intuition

Compute the change in average outcome for the treated group. Subtract the change in average outcome for the control group. The first difference removes time-invariant confounders specific to the treated group; the second difference removes period-specific shocks shared by both groups. What remains, under parallel trends, is the causal effect of treatment.

Proof Sketch

Decompose the treated-group change:

\mathbb{E}[Y_{i,1} | G_i=1] - \mathbb{E}[Y_{i,0} | G_i=1] = \mathbb{E}[Y_{i,1}(1) - Y_{i,0}(0) | G_i=1].

Add and subtract $\mathbb{E}[Y_{i,1}(0) | G_i=1]$ :

= \underbrace{\mathbb{E}[Y_{i,1}(1) - Y_{i,1}(0) | G_i=1]}_{=\tau_{\mathrm{ATT}}} + \mathbb{E}[Y_{i,1}(0) - Y_{i,0}(0) | G_i=1].

The second term is the counterfactual trend for the treated group, which is unobservable. Parallel trends replaces it with the observed control-group trend:

\mathbb{E}[Y_{i,1}(0) - Y_{i,0}(0) | G_i=1] = \mathbb{E}[Y_{i,1}(0) - Y_{i,0}(0) | G_i=0],

and SUTVA + control-not-treated identifies the right-hand side as $\mathbb{E}[Y_{i,1} | G_i=0] - \mathbb{E}[Y_{i,0} | G_i=0]$ . Substituting and rearranging gives the DiD identity.

Why It Matters

DiD's strength is that the parallel-trends assumption is partially testable: with three or more pre-treatment periods you can plot pre-trends and check visually whether they were parallel. A pre-trends test that fails is sufficient to reject the design; a pre-trends test that passes is necessary but not sufficient (the trends could diverge in the post-period for reasons unrelated to past behavior). Card and Krueger (1994) is the canonical application: compare New Jersey fast-food employment to eastern Pennsylvania after a NJ minimum-wage increase.

Failure Mode

Heterogeneous treatment timing. When units adopt treatment at different times, the standard two-way fixed-effects (TWFE) estimator $y_{it} = \alpha_i + \lambda_t + \tau D_{it} + \varepsilon_{it}$ does not identify a positively weighted average of unit-level effects. Goodman-Bacon (2021) and de Chaisemartin and D'Haultfoeuille (2020) show TWFE implicitly uses already-treated units as controls for newly-treated units, contaminating the estimand. Use Callaway-Sant'Anna (2021) or stacked DiD instead.

Anticipation effects. If treated units adjust behavior before the policy takes effect (e.g., firms front-load hiring before a tax rise), $Y_{i,0}$ for treated units already reflects treatment, breaking no-anticipation. Diagnostic: an event study should show flat coefficients in pre-periods.

Compositional changes. If the population in the treated cell changes between periods (in-migration, attrition), the same group $G=1$ is no longer the same units, and the difference includes selection. Diagnostic: balance pre-treatment covariates by period.

report a correction →

Regression discontinuity

Setting: treatment $D_i = \mathbf{1}\{R_i \geq c\}$ for a known cutoff $c$ on a running variable $R_i$ . Outcome $Y_i = Y_i(D_i)$ . The "sharp" RDD assumes deterministic assignment; the "fuzzy" RDD allows imperfect compliance and is essentially IV with the cutoff as instrument.

Theorem

RDD identification of the LATE at the cutoff (Hahn-Todd-Van der Klaauw 2001)

Intuition

Units just above and just below the cutoff are nearly identical on every covariate (observed and unobserved), because crossing the cutoff was effectively a coin flip for borderline units. The treatment status is the only thing that systematically differs. Any jump in $\mathbb{E}[Y \mid R=r]$ at $r=c$ must be the causal effect of treatment, evaluated at $R=c$ .

Proof Sketch

By continuity,

\lim_{r \uparrow c} \mathbb{E}[Y_i | R_i = r] = \mathbb{E}[Y_i(0) | R_i = c], \qquad \lim_{r \downarrow c} \mathbb{E}[Y_i | R_i = r] = \mathbb{E}[Y_i(1) | R_i = c].

The first equality holds because for $r < c$ all units are untreated, so $Y_i = Y_i(0)$ , and continuity allows the limit. The second holds symmetrically for $r > c$ . Subtracting:

\lim_{r \downarrow c} \mathbb{E}[Y_i | R_i = r] - \lim_{r \uparrow c} \mathbb{E}[Y_i | R_i = r] = \mathbb{E}[Y_i(1) - Y_i(0) | R_i = c] = \tau(c).

The estimand is a conditional ATE at $R=c$ , not the unconditional ATE. Without further assumptions, you cannot extrapolate $\tau(c)$ to units far from the cutoff.

Why It Matters

RDD is the closest thing to a randomized experiment in non-experimental policy data. The continuity assumption is mild and partially testable (McCrary 2008 density test for manipulation; covariate-balance plots for unobserved selection). Estimation in practice uses local linear or polynomial regression on each side of the cutoff with an MSE-optimal bandwidth (Calonico-Cattaneo-Titiunik 2014). Imbens and Lemieux (2008) is the canonical practitioner survey.

Failure Mode

Manipulation around the cutoff. If units can precisely control $R_i$ (e.g., test scores when teachers grade their own students; income when applying for a means-tested benefit), the density $f_R$ has a jump at $c$ and units just above the cutoff differ systematically from units just below. The McCrary 2008 test detects this; a significant density discontinuity invalidates the RDD.

Compound treatments at the cutoff. If multiple programs share the same eligibility cutoff (a means-tested credit triggers eligibility for two unrelated programs), the RDD identifies the joint effect of all of them, not the policy of interest. Read the institutional details before estimating.

Bandwidth dependence. Estimates can be sensitive to bandwidth choice. Calonico-Cattaneo-Titiunik 2014 give bias-corrected confidence intervals robust to MSE-optimal bandwidth selection; report the optimal bandwidth and run sensitivity to halving and doubling it.

Functional-form artefacts. Global high-order polynomial fits (e.g., quartic on each side) introduce edge effects that masquerade as discontinuities. Gelman and Imbens (2019) show why local linear is preferred over global polynomial.

report a correction →

Instrumental variables and the LATE

When $D_i$ is endogenous (correlated with unobserved $Y_i(d)$ confounders), an instrument $Z_i$ that affects $D_i$ but only affects $Y_i$ through $D_i$ identifies a causal effect on a specific subpopulation. The Imbens-Angrist (1994) LATE theorem makes this precise.

Let $D_i(z)$ denote the potential treatment status under instrument value $z \in \{0,1\}$ . Each unit has a type:

Always-takers: $D_i(0) = D_i(1) = 1$ .
Never-takers: $D_i(0) = D_i(1) = 0$ .
Compliers: $D_i(0) = 0, D_i(1) = 1$ .
Defiers: $D_i(0) = 1, D_i(1) = 0$ .

Theorem

Local Average Treatment Effect (Imbens-Angrist 1994)

Intuition

The instrument is a randomized nudge: units with $Z_i = 1$ are pushed toward treatment relative to $Z_i = 0$ . The numerator of the Wald ratio is the reduced-form effect of the nudge on outcomes (an ITT-like quantity). The denominator is the first-stage effect of the nudge on actual treatment uptake. The ratio rescales: per unit of induced treatment, what is the outcome change? Under monotonicity, the only units the nudge moves are compliers, so the ratio is the average effect on compliers.

Proof Sketch

By exogeneity and exclusion, $\mathbb{E}[Y_i \mid Z_i = z] = \mathbb{E}[Y_i(D_i(z))]$ . Decompose by type:

\mathbb{E}[Y_i | Z_i=1] - \mathbb{E}[Y_i | Z_i=0] = \mathbb{E}[Y_i(D_i(1)) - Y_i(D_i(0))].

For always-takers, $D_i(1) = D_i(0) = 1$ , so the difference is $0$ .
For never-takers, $D_i(1) = D_i(0) = 0$ , so the difference is $0$ .
For compliers, $D_i(1) = 1, D_i(0) = 0$ , so the difference is $Y_i(1) - Y_i(0)$ .
For defiers, $D_i(1) = 0, D_i(0) = 1$ , so the difference is $-(Y_i(1) - Y_i(0))$ .

Monotonicity rules out defiers. The numerator becomes

\Pr(\mathrm{complier}) \cdot \mathbb{E}[Y_i(1) - Y_i(0) | \mathrm{complier}].

The denominator $\mathbb{E}[D_i \mid Z_i = 1] - \mathbb{E}[D_i \mid Z_i = 0]$ is exactly $\Pr(\mathrm{complier})$ under the same arguments (always-takers and never-takers contribute zero, defiers excluded). The ratio gives $\mathbb{E}[Y_i(1) - Y_i(0) \mid \mathrm{complier}]$ .

Why It Matters

LATE is the cleanest identification result for endogenous treatments and is the workhorse of natural-experiment economics. Angrist (1990) used the Vietnam draft lottery as an instrument for veteran status; Angrist and Krueger (1991) used quarter-of-birth as an instrument for years of schooling. The result also clarifies a previously fuzzy claim: IV does not identify the ATE; it identifies a complier-specific effect whose policy relevance depends on who the compliers are. A policy that targets always-takers gains nothing from a LATE estimated off compliers.

Failure Mode

Weak instruments. If $\mathrm{Cov}(Z_i, D_i)$ is small, the denominator is near zero and tiny violations of exclusion are amplified into large bias. Stock-Yogo (2005) tabulate first-stage F-statistic thresholds (rule of thumb: $F > 10$ for one instrument; tighter cutoffs from Lee-McCrary-Moreira-Porter 2022 with weak-IV-robust inference).

Exclusion violations. $Z_i$ may affect $Y_i$ through paths other than $D_i$ . The quarter-of-birth instrument (Angrist-Krueger 1991) is challenged on the grounds that birth season correlates with maternal characteristics that independently affect earnings.

Monotonicity violations. "Defiers" sound exotic but are common when the instrument is preference-based (a free product offer might raise uptake among most consumers but reduce it among prestige-sensitive ones). De Chaisemartin (2017) develops weaker "compliers-defiers" identification.

Heterogeneous treatment effects. Multiple-instrument 2SLS does not average LATEs in a policy-relevant way; it gives an overidentification-weighted combination of instrument-specific LATEs, often with negative weights (Mogstad-Torgovitsky-Walters 2021). Use marginal-treatment-effect (MTE) frameworks (Heckman-Vytlacil 2005) when this matters.

report a correction →

Beyond DiD/RDD/IV: synthetic control and double ML

Synthetic control (Abadie-Diamond-Hainmueller 2010). For a single treated unit (a country, a region), construct a weighted average of untreated units whose pre-treatment outcomes match the treated unit, then compare post-treatment trajectories. The weights $w_j \geq 0, \sum_j w_j = 1$ are chosen to minimize pre-period outcome distance. Inference is permutation-based: re-run the procedure with each donor as a placebo treated and compare the treated-unit gap to the placebo distribution. Used to estimate the effect of California's 1988 tobacco-control program (Proposition 99); the synthetic-California control predicts what California cigarette consumption would have been absent the law.

Double/debiased ML (Chernozhukov-Chetverikov-Demirer-Duflo-Hansen-Newey-Robins 2018). When you have many controls $X_i$ and want to estimate a low-dimensional treatment effect $\theta$ from a partially-linear model $Y_i = \theta D_i + g(X_i) + \varepsilon_i$ , lasso/forest/NN estimates of $g$ have first-order bias that does not vanish at $\sqrt{n}$ . Double ML uses Neyman-orthogonal moment conditions plus cross-fitting (estimate $g$ on one fold, plug into moments on the other) to recover a $\sqrt{n}$ -consistent, asymptotically normal $\hat\theta$ :

\sqrt{n}(\hat\theta - \theta) \xrightarrow{d} N(0, \sigma^2),

even when $\hat g$ converges only at rate $n^{-1/4}$ . The same recipe extends to DiD, IV, and partially linear quantile models. The crucial caveat: identification still comes from the research design. Double ML buys robustness to nuisance specification, not identification.

Common Confusions

Watch Out

Parallel trends is not parallel levels

DiD does not require treated and control units to have the same pre-period level of the outcome. It requires their counterfactual trends in the post-period to be the same. Mechanical convergence or divergence that predates treatment violates the assumption even when levels match at baseline, and matching levels tells you nothing about whether trends would have continued in parallel.

Watch Out

LATE is not the ATE, the ATT, or the policy effect

The Wald estimand identifies the average effect on compliers under the chosen instrument. Always-takers and never-takers contribute zero; their effects are not identified at all by IV with this instrument. A different instrument induces a different complier subpopulation and identifies a different LATE. "The IV estimate is 0.3" only means something once you know who the compliers are.

Watch Out

Double ML does not create identification

Replacing OLS controls with a gradient-boosted residualizer does not turn an observational study into a causal one. If the treatment-assignment mechanism is not captured by the controls, the estimate is biased regardless of how flexible the nuisance model is. Double ML buys robustness to nuisance specification, not identification.

Watch Out

RDD identifies a local effect

The RDD estimand is the conditional ATE at $R = c$ . Extrapolating to units far from the cutoff requires extra structure (a parametric model of $\tau(R)$ , or marginal-treatment-effect machinery). A bipartisan-popular policy that helps people just under the eligibility cutoff might harm people far below it, and the RDD will not detect that.

Watch Out

Significant pre-trends test does not save you

A common mistake is to run a pre-trends F-test, fail to reject, and treat that as evidence that parallel trends holds. Pre-trends tests have low power against the kinds of small, persistent divergences that bias DiD estimates (Roth 2022 makes this quantitative). An insignificant pre-trends test is necessary but never sufficient.

Worked Example: minimum wage as DiD

Card and Krueger (1994) compared full-time-equivalent (FTE) employment at fast-food restaurants in New Jersey and eastern Pennsylvania before and after NJ's April 1992 minimum-wage increase from $4.25 to$ 5.05/hour.

Group	FTE Feb 1992	FTE Nov 1992	$\Delta$
New Jersey (treated)	20.4	21.0	+0.6
Eastern PA (control)	23.3	21.2	-2.1

DiD estimate: $\hat\tau_{\mathrm{ATT}} = (+0.6) - (-2.1) = +2.7$ FTE per restaurant. Interpreted causally under parallel trends: the minimum-wage increase raised rather than lowered NJ employment, contradicting the textbook supply-demand prediction. The result was contested empirically (Neumark-Wascher 2000 reanalyzed payroll data and found a negative effect) and methodologically (sensitivity to the control area), and the broader minimum-wage literature has converged on small employment effects with substantial heterogeneity.

The methodological lesson is independent of the substantive conclusion: the design made its identifying assumption explicit. Critics could attack parallel trends, propose alternative controls, or question SUTVA (cross-border employment shifts), and the debate stayed about identification rather than functional form.

ExerciseCore

Problem

You estimate a DiD on the minimum-wage data and get $\hat\tau = +2.7$ FTE per restaurant. A reviewer points out that NJ and eastern PA had different pre-period employment trends in 1989-1991: NJ was rising at $+0.5$ FTE/year while eastern PA was flat. Re-derive the DiD identification proof under the assumption that the pre-trend gap of $+0.5$ FTE/year would have continued in 1992 absent treatment, and adjust the DiD estimate accordingly. Then state precisely what assumption the adjustment requires beyond plain parallel trends.

ExerciseAdvanced

Problem

A school district admits students into a gifted-and-talented program if and only if their entrance test score $R_i$ is at least $c = 130$ . Five years later, you observe high-school GPA $Y_i$ . You estimate an RDD with local linear regression on each side of $c$ , MSE-optimal bandwidth, and find $\hat\tau(130) = 0.4$ GPA points.

(a) State precisely what causal quantity $\hat\tau(130)$ estimates. (b) A colleague claims this means "expanding the program to students at $R = 120$ would raise their GPA by 0.4". Why is this wrong? (c) The McCrary density test on $R$ shows a small but statistically significant jump at $c = 130$ . What threats does this raise, and what should you check?

ExerciseAdvanced

Problem

Show by example that monotonicity in the LATE theorem is essential, not a technicality. Construct a population with always-takers, never-takers, compliers, and a small number of defiers, with explicit potential outcomes, such that the Wald estimand recovers a number that is not the average treatment effect on any subpopulation of interest (not the ATE, not the ATT, not the LATE on compliers).

Then explain why de Chaisemartin's (2017) "compliers-defiers" framework can still extract policy-relevant information from such a population, and what extra assumption it needs.

References

Guido W. Imbens, Joshua D. Angrist. Identification and Estimation of Local Average Treatment Effects. Econometrica 62(2), 1994. The LATE theorem; statement, proof, and the four-types decomposition that organizes modern IV.
Jinyong Hahn, Petra Todd, Wilbert Van der Klaauw. Identification and Estimation of Treatment Effects with a Regression-Discontinuity Design. Econometrica 69(1), 2001. The RDD identification result under continuity, including the fuzzy RDD as IV with cutoff instrument.
David Card, Alan B. Krueger. Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania. American Economic Review 84(4), 1994. The canonical DiD application.
Alberto Abadie, Alexis Diamond, Jens Hainmueller. Synthetic Control Methods for Comparative Case Studies. JASA 105(490), 2010. Synthetic control with a worked application to California's Proposition 99.
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, James Robins. Double/Debiased Machine Learning for Treatment and Structural Parameters. Econometrics Journal 21(1), 2018. Neyman-orthogonal moments, cross-fitting, and root-n consistency with ML nuisance estimators. arXiv:1608.00060
Andrew Goodman-Bacon. Difference-in-Differences with Variation in Treatment Timing. Journal of Econometrics 225(2), 2021. The decomposition theorem showing TWFE with staggered timing implicitly uses already-treated units as controls, with negative weights.
Brantly Callaway, Pedro H. C. Sant'Anna. Difference-in-Differences with Multiple Time Periods. Journal of Econometrics 225(2), 2021. A staggered-DiD estimator with positive weights and an event-study decomposition robust to heterogeneous timing effects.
Sebastian Calonico, Matias D. Cattaneo, Rocio Titiunik. Robust Nonparametric Confidence Intervals for Regression-Discontinuity Designs. Econometrica 82(6), 2014. MSE-optimal bandwidth selection and bias-corrected inference for RDD.
Joshua D. Angrist, Jorn-Steffen Pischke. Mostly Harmless Econometrics: An Empiricist's Companion. Princeton, 2009. Chapters 4 (IV), 5 (panel data and DiD), 6 (regression discontinuity); the standard graduate text on the credibility revolution toolkit.
Guido W. Imbens, Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge, 2015. Parts II-V cover randomized experiments, regular assignment, and IV with the LATE framework in full generality.
Scott Cunningham. Causal Inference: The Mixtape. Yale, 2021. Chapters on DiD, RDD, IV, and synthetic control with R and Stata code; an accessible second source.
Justin McCrary. Manipulation of the Running Variable in the Regression Discontinuity Design: A Density Test. Journal of Econometrics 142(2), 2008. The density-discontinuity test for manipulation in RDD.

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Causal Inference and the Ladder of Causationlayer 3 · tier 1
Hypothesis Testing for MLlayer 2 · tier 2
Causal Inference Basicslayer 3 · tier 3

Derived topics

2

Design-Based vs. Model-Based Inferencelayer 2 · tier 2
Longitudinal Surveys and Panel Datalayer 3 · tier 3

Graph-backed continuations

Design-Based vs. Model-Based Inference Longitudinal Surveys and Panel Data

Read this page in the graph.

Why This Matters

Setup: potential outcomes

Difference-in-differences

Regression discontinuity

Instrumental variables and the LATE

Beyond DiD/RDD/IV: synthetic control and double ML

Common Confusions

Worked Example: minimum wage as DiD

References

Related Topics

Required before and derived from this topic

Required prerequisites

Derived topics