Skip to main content

Applied ML

Causal Inference for Policy Evaluation

Quasi-experimental methods for recovering policy effects without randomization. Difference-in-differences identifies the average treatment effect on the treated under parallel trends; regression discontinuity identifies a local average treatment effect under continuity at the cutoff; instrumental variables identifies a local average treatment effect for compliers under monotonicity (Imbens-Angrist 1994). Synthetic control and double/debiased ML extend these designs to single-unit and high-dimensional settings.

AdvancedTier 2Current~35 min
0

Why This Matters

Most policy questions cannot be answered by a randomized trial. A minimum-wage change, a school reform, a tariff, a tax credit: the intervention is allocated by legislatures, geography, or eligibility cutoffs, and the analyst sees one realization. The credibility revolution (Angrist and Pischke 2010) reorganized empirical economics around research designs that recover causal effects under transparent, testable assumptions rather than around structural models that require many more.

Three identification results carry most of the load: difference-in-differences (DiD) under parallel trends, regression discontinuity (RDD) under continuity at the cutoff, and instrumental variables (IV) under exogeneity and monotonicity. Each is a theorem of the form "under assumption A, the estimand θ\theta equals the causal quantity of interest." Knowing the theorem makes the assumption visible; a Stata xtreg does not.

This page states the three identification theorems precisely, sketches the proofs, and shows the failure modes that show up in real policy work. It also covers synthetic control and double/debiased ML, which extend the classical designs to single-unit and high-dimensional regimes.

Setup: potential outcomes

Use the Neyman-Rubin potential-outcomes framework. For each unit ii and treatment status d{0,1}d \in \{0,1\}, let Yi(d)Y_i(d) denote the potential outcome under treatment dd. Only one is observed: Yi=DiYi(1)+(1Di)Yi(0)Y_i = D_i Y_i(1) + (1-D_i) Y_i(0). The fundamental quantities are:

  • Average treatment effect (ATE): τ=E[Yi(1)Yi(0)]\tau = \mathbb{E}[Y_i(1) - Y_i(0)].
  • Average treatment effect on the treated (ATT): τATT=E[Yi(1)Yi(0)Di=1]\tau_{\mathrm{ATT}} = \mathbb{E}[Y_i(1) - Y_i(0) \mid D_i = 1].
  • Local average treatment effect (LATE): the ATE on a specific subpopulation of compliers, defined below.

Without randomization, none of these is identified by the observed joint (Yi,Di)(Y_i, D_i) alone. Each design adds a structural assumption that closes the gap.

Difference-in-differences

Setting: panel data with two periods t{0,1}t \in \{0,1\} and a treatment group G{0,1}G \in \{0,1\} that receives treatment only in t=1t=1. Observed outcomes Yi,tY_{i,t} for ii in either group. Define Yi,t(d)Y_{i,t}(d) as the potential outcome at time tt under treatment status dd.

Theorem

DiD identification of the ATT under parallel trends

Intuition

Compute the change in average outcome for the treated group. Subtract the change in average outcome for the control group. The first difference removes time-invariant confounders specific to the treated group; the second difference removes period-specific shocks shared by both groups. What remains, under parallel trends, is the causal effect of treatment.

Proof Sketch

Decompose the treated-group change:

E[Yi,1Gi=1]E[Yi,0Gi=1]=E[Yi,1(1)Yi,0(0)Gi=1].\mathbb{E}[Y_{i,1} | G_i=1] - \mathbb{E}[Y_{i,0} | G_i=1] = \mathbb{E}[Y_{i,1}(1) - Y_{i,0}(0) | G_i=1].

Add and subtract E[Yi,1(0)Gi=1]\mathbb{E}[Y_{i,1}(0) | G_i=1]:

=E[Yi,1(1)Yi,1(0)Gi=1]=τATT+E[Yi,1(0)Yi,0(0)Gi=1].= \underbrace{\mathbb{E}[Y_{i,1}(1) - Y_{i,1}(0) | G_i=1]}_{=\tau_{\mathrm{ATT}}} + \mathbb{E}[Y_{i,1}(0) - Y_{i,0}(0) | G_i=1].

The second term is the counterfactual trend for the treated group, which is unobservable. Parallel trends replaces it with the observed control-group trend:

E[Yi,1(0)Yi,0(0)Gi=1]=E[Yi,1(0)Yi,0(0)Gi=0],\mathbb{E}[Y_{i,1}(0) - Y_{i,0}(0) | G_i=1] = \mathbb{E}[Y_{i,1}(0) - Y_{i,0}(0) | G_i=0],

and SUTVA + control-not-treated identifies the right-hand side as E[Yi,1Gi=0]E[Yi,0Gi=0]\mathbb{E}[Y_{i,1} | G_i=0] - \mathbb{E}[Y_{i,0} | G_i=0]. Substituting and rearranging gives the DiD identity.

Why It Matters

DiD's strength is that the parallel-trends assumption is partially testable: with three or more pre-treatment periods you can plot pre-trends and check visually whether they were parallel. A pre-trends test that fails is sufficient to reject the design; a pre-trends test that passes is necessary but not sufficient (the trends could diverge in the post-period for reasons unrelated to past behavior). Card and Krueger (1994) is the canonical application: compare New Jersey fast-food employment to eastern Pennsylvania after a NJ minimum-wage increase.

Failure Mode

Heterogeneous treatment timing. When units adopt treatment at different times, the standard two-way fixed-effects (TWFE) estimator yit=αi+λt+τDit+εity_{it} = \alpha_i + \lambda_t + \tau D_{it} + \varepsilon_{it} does not identify a positively weighted average of unit-level effects. Goodman-Bacon (2021) and de Chaisemartin and D'Haultfoeuille (2020) show TWFE implicitly uses already-treated units as controls for newly-treated units, contaminating the estimand. Use Callaway-Sant'Anna (2021) or stacked DiD instead.

Anticipation effects. If treated units adjust behavior before the policy takes effect (e.g., firms front-load hiring before a tax rise), Yi,0Y_{i,0} for treated units already reflects treatment, breaking no-anticipation. Diagnostic: an event study should show flat coefficients in pre-periods.

Compositional changes. If the population in the treated cell changes between periods (in-migration, attrition), the same group G=1G=1 is no longer the same units, and the difference includes selection. Diagnostic: balance pre-treatment covariates by period.

Regression discontinuity

Setting: treatment Di=1{Ric}D_i = \mathbf{1}\{R_i \geq c\} for a known cutoff cc on a running variable RiR_i. Outcome Yi=Yi(Di)Y_i = Y_i(D_i). The "sharp" RDD assumes deterministic assignment; the "fuzzy" RDD allows imperfect compliance and is essentially IV with the cutoff as instrument.

Theorem

RDD identification of the LATE at the cutoff (Hahn-Todd-Van der Klaauw 2001)

Intuition

Units just above and just below the cutoff are nearly identical on every covariate (observed and unobserved), because crossing the cutoff was effectively a coin flip for borderline units. The treatment status is the only thing that systematically differs. Any jump in E[YR=r]\mathbb{E}[Y \mid R=r] at r=cr=c must be the causal effect of treatment, evaluated at R=cR=c.

Proof Sketch

By continuity,

limrcE[YiRi=r]=E[Yi(0)Ri=c],limrcE[YiRi=r]=E[Yi(1)Ri=c].\lim_{r \uparrow c} \mathbb{E}[Y_i | R_i = r] = \mathbb{E}[Y_i(0) | R_i = c], \qquad \lim_{r \downarrow c} \mathbb{E}[Y_i | R_i = r] = \mathbb{E}[Y_i(1) | R_i = c].

The first equality holds because for r<cr < c all units are untreated, so Yi=Yi(0)Y_i = Y_i(0), and continuity allows the limit. The second holds symmetrically for r>cr > c. Subtracting:

limrcE[YiRi=r]limrcE[YiRi=r]=E[Yi(1)Yi(0)Ri=c]=τ(c).\lim_{r \downarrow c} \mathbb{E}[Y_i | R_i = r] - \lim_{r \uparrow c} \mathbb{E}[Y_i | R_i = r] = \mathbb{E}[Y_i(1) - Y_i(0) | R_i = c] = \tau(c).

The estimand is a conditional ATE at R=cR=c, not the unconditional ATE. Without further assumptions, you cannot extrapolate τ(c)\tau(c) to units far from the cutoff.

Why It Matters

RDD is the closest thing to a randomized experiment in non-experimental policy data. The continuity assumption is mild and partially testable (McCrary 2008 density test for manipulation; covariate-balance plots for unobserved selection). Estimation in practice uses local linear or polynomial regression on each side of the cutoff with an MSE-optimal bandwidth (Calonico-Cattaneo-Titiunik 2014). Imbens and Lemieux (2008) is the canonical practitioner survey.

Failure Mode

Manipulation around the cutoff. If units can precisely control RiR_i (e.g., test scores when teachers grade their own students; income when applying for a means-tested benefit), the density fRf_R has a jump at cc and units just above the cutoff differ systematically from units just below. The McCrary 2008 test detects this; a significant density discontinuity invalidates the RDD.

Compound treatments at the cutoff. If multiple programs share the same eligibility cutoff (a means-tested credit triggers eligibility for two unrelated programs), the RDD identifies the joint effect of all of them, not the policy of interest. Read the institutional details before estimating.

Bandwidth dependence. Estimates can be sensitive to bandwidth choice. Calonico-Cattaneo-Titiunik 2014 give bias-corrected confidence intervals robust to MSE-optimal bandwidth selection; report the optimal bandwidth and run sensitivity to halving and doubling it.

Functional-form artefacts. Global high-order polynomial fits (e.g., quartic on each side) introduce edge effects that masquerade as discontinuities. Gelman and Imbens (2019) show why local linear is preferred over global polynomial.

Instrumental variables and the LATE

When DiD_i is endogenous (correlated with unobserved Yi(d)Y_i(d) confounders), an instrument ZiZ_i that affects DiD_i but only affects YiY_i through DiD_i identifies a causal effect on a specific subpopulation. The Imbens-Angrist (1994) LATE theorem makes this precise.

Let Di(z)D_i(z) denote the potential treatment status under instrument value z{0,1}z \in \{0,1\}. Each unit has a type:

  • Always-takers: Di(0)=Di(1)=1D_i(0) = D_i(1) = 1.
  • Never-takers: Di(0)=Di(1)=0D_i(0) = D_i(1) = 0.
  • Compliers: Di(0)=0,Di(1)=1D_i(0) = 0, D_i(1) = 1.
  • Defiers: Di(0)=1,Di(1)=0D_i(0) = 1, D_i(1) = 0.
Theorem

Local Average Treatment Effect (Imbens-Angrist 1994)

Intuition

The instrument is a randomized nudge: units with Zi=1Z_i = 1 are pushed toward treatment relative to Zi=0Z_i = 0. The numerator of the Wald ratio is the reduced-form effect of the nudge on outcomes (an ITT-like quantity). The denominator is the first-stage effect of the nudge on actual treatment uptake. The ratio rescales: per unit of induced treatment, what is the outcome change? Under monotonicity, the only units the nudge moves are compliers, so the ratio is the average effect on compliers.

Proof Sketch

By exogeneity and exclusion, E[YiZi=z]=E[Yi(Di(z))]\mathbb{E}[Y_i \mid Z_i = z] = \mathbb{E}[Y_i(D_i(z))]. Decompose by type:

E[YiZi=1]E[YiZi=0]=E[Yi(Di(1))Yi(Di(0))].\mathbb{E}[Y_i | Z_i=1] - \mathbb{E}[Y_i | Z_i=0] = \mathbb{E}[Y_i(D_i(1)) - Y_i(D_i(0))].
  • For always-takers, Di(1)=Di(0)=1D_i(1) = D_i(0) = 1, so the difference is 00.
  • For never-takers, Di(1)=Di(0)=0D_i(1) = D_i(0) = 0, so the difference is 00.
  • For compliers, Di(1)=1,Di(0)=0D_i(1) = 1, D_i(0) = 0, so the difference is Yi(1)Yi(0)Y_i(1) - Y_i(0).
  • For defiers, Di(1)=0,Di(0)=1D_i(1) = 0, D_i(0) = 1, so the difference is (Yi(1)Yi(0))-(Y_i(1) - Y_i(0)).

Monotonicity rules out defiers. The numerator becomes

Pr(complier)E[Yi(1)Yi(0)complier].\Pr(\mathrm{complier}) \cdot \mathbb{E}[Y_i(1) - Y_i(0) | \mathrm{complier}].

The denominator E[DiZi=1]E[DiZi=0]\mathbb{E}[D_i \mid Z_i = 1] - \mathbb{E}[D_i \mid Z_i = 0] is exactly Pr(complier)\Pr(\mathrm{complier}) under the same arguments (always-takers and never-takers contribute zero, defiers excluded). The ratio gives E[Yi(1)Yi(0)complier]\mathbb{E}[Y_i(1) - Y_i(0) \mid \mathrm{complier}].

Why It Matters

LATE is the cleanest identification result for endogenous treatments and is the workhorse of natural-experiment economics. Angrist (1990) used the Vietnam draft lottery as an instrument for veteran status; Angrist and Krueger (1991) used quarter-of-birth as an instrument for years of schooling. The result also clarifies a previously fuzzy claim: IV does not identify the ATE; it identifies a complier-specific effect whose policy relevance depends on who the compliers are. A policy that targets always-takers gains nothing from a LATE estimated off compliers.

Failure Mode

Weak instruments. If Cov(Zi,Di)\mathrm{Cov}(Z_i, D_i) is small, the denominator is near zero and tiny violations of exclusion are amplified into large bias. Stock-Yogo (2005) tabulate first-stage F-statistic thresholds (rule of thumb: F>10F > 10 for one instrument; tighter cutoffs from Lee-McCrary-Moreira-Porter 2022 with weak-IV-robust inference).

Exclusion violations. ZiZ_i may affect YiY_i through paths other than DiD_i. The quarter-of-birth instrument (Angrist-Krueger 1991) is challenged on the grounds that birth season correlates with maternal characteristics that independently affect earnings.

Monotonicity violations. "Defiers" sound exotic but are common when the instrument is preference-based (a free product offer might raise uptake among most consumers but reduce it among prestige-sensitive ones). De Chaisemartin (2017) develops weaker "compliers-defiers" identification.

Heterogeneous treatment effects. Multiple-instrument 2SLS does not average LATEs in a policy-relevant way; it gives an overidentification-weighted combination of instrument-specific LATEs, often with negative weights (Mogstad-Torgovitsky-Walters 2021). Use marginal-treatment-effect (MTE) frameworks (Heckman-Vytlacil 2005) when this matters.

Beyond DiD/RDD/IV: synthetic control and double ML

Synthetic control (Abadie-Diamond-Hainmueller 2010). For a single treated unit (a country, a region), construct a weighted average of untreated units whose pre-treatment outcomes match the treated unit, then compare post-treatment trajectories. The weights wj0,jwj=1w_j \geq 0, \sum_j w_j = 1 are chosen to minimize pre-period outcome distance. Inference is permutation-based: re-run the procedure with each donor as a placebo treated and compare the treated-unit gap to the placebo distribution. Used to estimate the effect of California's 1988 tobacco-control program (Proposition 99); the synthetic-California control predicts what California cigarette consumption would have been absent the law.

Double/debiased ML (Chernozhukov-Chetverikov-Demirer-Duflo-Hansen-Newey-Robins 2018). When you have many controls XiX_i and want to estimate a low-dimensional treatment effect θ\theta from a partially-linear model Yi=θDi+g(Xi)+εiY_i = \theta D_i + g(X_i) + \varepsilon_i, lasso/forest/NN estimates of gg have first-order bias that does not vanish at n\sqrt{n}. Double ML uses Neyman-orthogonal moment conditions plus cross-fitting (estimate gg on one fold, plug into moments on the other) to recover a n\sqrt{n}-consistent, asymptotically normal θ^\hat\theta:

n(θ^θ)dN(0,σ2),\sqrt{n}(\hat\theta - \theta) \xrightarrow{d} N(0, \sigma^2),

even when g^\hat g converges only at rate n1/4n^{-1/4}. The same recipe extends to DiD, IV, and partially linear quantile models. The crucial caveat: identification still comes from the research design. Double ML buys robustness to nuisance specification, not identification.

Common Confusions

Watch Out

Parallel trends is not parallel levels

DiD does not require treated and control units to have the same pre-period level of the outcome. It requires their counterfactual trends in the post-period to be the same. Mechanical convergence or divergence that predates treatment violates the assumption even when levels match at baseline, and matching levels tells you nothing about whether trends would have continued in parallel.

Watch Out

LATE is not the ATE, the ATT, or the policy effect

The Wald estimand identifies the average effect on compliers under the chosen instrument. Always-takers and never-takers contribute zero; their effects are not identified at all by IV with this instrument. A different instrument induces a different complier subpopulation and identifies a different LATE. "The IV estimate is 0.3" only means something once you know who the compliers are.

Watch Out

Double ML does not create identification

Replacing OLS controls with a gradient-boosted residualizer does not turn an observational study into a causal one. If the treatment-assignment mechanism is not captured by the controls, the estimate is biased regardless of how flexible the nuisance model is. Double ML buys robustness to nuisance specification, not identification.

Watch Out

RDD identifies a local effect

The RDD estimand is the conditional ATE at R=cR = c. Extrapolating to units far from the cutoff requires extra structure (a parametric model of τ(R)\tau(R), or marginal-treatment-effect machinery). A bipartisan-popular policy that helps people just under the eligibility cutoff might harm people far below it, and the RDD will not detect that.

Watch Out

Significant pre-trends test does not save you

A common mistake is to run a pre-trends F-test, fail to reject, and treat that as evidence that parallel trends holds. Pre-trends tests have low power against the kinds of small, persistent divergences that bias DiD estimates (Roth 2022 makes this quantitative). An insignificant pre-trends test is necessary but never sufficient.

Worked Example: minimum wage as DiD

Card and Krueger (1994) compared full-time-equivalent (FTE) employment at fast-food restaurants in New Jersey and eastern Pennsylvania before and after NJ's April 1992 minimum-wage increase from 4.25to4.25 to 5.05/hour.

GroupFTE Feb 1992FTE Nov 1992Δ\Delta
New Jersey (treated)20.421.0+0.6
Eastern PA (control)23.321.2-2.1

DiD estimate: τ^ATT=(+0.6)(2.1)=+2.7\hat\tau_{\mathrm{ATT}} = (+0.6) - (-2.1) = +2.7 FTE per restaurant. Interpreted causally under parallel trends: the minimum-wage increase raised rather than lowered NJ employment, contradicting the textbook supply-demand prediction. The result was contested empirically (Neumark-Wascher 2000 reanalyzed payroll data and found a negative effect) and methodologically (sensitivity to the control area), and the broader minimum-wage literature has converged on small employment effects with substantial heterogeneity.

The methodological lesson is independent of the substantive conclusion: the design made its identifying assumption explicit. Critics could attack parallel trends, propose alternative controls, or question SUTVA (cross-border employment shifts), and the debate stayed about identification rather than functional form.

ExerciseCore

Problem

You estimate a DiD on the minimum-wage data and get τ^=+2.7\hat\tau = +2.7 FTE per restaurant. A reviewer points out that NJ and eastern PA had different pre-period employment trends in 1989-1991: NJ was rising at +0.5+0.5 FTE/year while eastern PA was flat. Re-derive the DiD identification proof under the assumption that the pre-trend gap of +0.5+0.5 FTE/year would have continued in 1992 absent treatment, and adjust the DiD estimate accordingly. Then state precisely what assumption the adjustment requires beyond plain parallel trends.

ExerciseAdvanced

Problem

A school district admits students into a gifted-and-talented program if and only if their entrance test score RiR_i is at least c=130c = 130. Five years later, you observe high-school GPA YiY_i. You estimate an RDD with local linear regression on each side of cc, MSE-optimal bandwidth, and find τ^(130)=0.4\hat\tau(130) = 0.4 GPA points.

(a) State precisely what causal quantity τ^(130)\hat\tau(130) estimates. (b) A colleague claims this means "expanding the program to students at R=120R = 120 would raise their GPA by 0.4". Why is this wrong? (c) The McCrary density test on RR shows a small but statistically significant jump at c=130c = 130. What threats does this raise, and what should you check?

ExerciseAdvanced

Problem

Show by example that monotonicity in the LATE theorem is essential, not a technicality. Construct a population with always-takers, never-takers, compliers, and a small number of defiers, with explicit potential outcomes, such that the Wald estimand recovers a number that is not the average treatment effect on any subpopulation of interest (not the ATE, not the ATT, not the LATE on compliers).

Then explain why de Chaisemartin's (2017) "compliers-defiers" framework can still extract policy-relevant information from such a population, and what extra assumption it needs.

References

Related Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics