Applied ML
Causal Inference for Policy Evaluation
Quasi-experimental methods for recovering policy effects without randomization. Difference-in-differences identifies the average treatment effect on the treated under parallel trends; regression discontinuity identifies a local average treatment effect under continuity at the cutoff; instrumental variables identifies a local average treatment effect for compliers under monotonicity (Imbens-Angrist 1994). Synthetic control and double/debiased ML extend these designs to single-unit and high-dimensional settings.
Why This Matters
Most policy questions cannot be answered by a randomized trial. A minimum-wage change, a school reform, a tariff, a tax credit: the intervention is allocated by legislatures, geography, or eligibility cutoffs, and the analyst sees one realization. The credibility revolution (Angrist and Pischke 2010) reorganized empirical economics around research designs that recover causal effects under transparent, testable assumptions rather than around structural models that require many more.
Three identification results carry most of the load: difference-in-differences (DiD) under parallel trends, regression discontinuity (RDD) under continuity at the cutoff, and instrumental variables (IV) under exogeneity and monotonicity. Each is a theorem of the form "under assumption A, the estimand equals the causal quantity of interest." Knowing the theorem makes the assumption visible; a Stata xtreg does not.
This page states the three identification theorems precisely, sketches the proofs, and shows the failure modes that show up in real policy work. It also covers synthetic control and double/debiased ML, which extend the classical designs to single-unit and high-dimensional regimes.
Setup: potential outcomes
Use the Neyman-Rubin potential-outcomes framework. For each unit and treatment status , let denote the potential outcome under treatment . Only one is observed: . The fundamental quantities are:
- Average treatment effect (ATE): .
- Average treatment effect on the treated (ATT): .
- Local average treatment effect (LATE): the ATE on a specific subpopulation of compliers, defined below.
Without randomization, none of these is identified by the observed joint alone. Each design adds a structural assumption that closes the gap.
Difference-in-differences
Setting: panel data with two periods and a treatment group that receives treatment only in . Observed outcomes for in either group. Define as the potential outcome at time under treatment status .
DiD identification of the ATT under parallel trends
Intuition
Compute the change in average outcome for the treated group. Subtract the change in average outcome for the control group. The first difference removes time-invariant confounders specific to the treated group; the second difference removes period-specific shocks shared by both groups. What remains, under parallel trends, is the causal effect of treatment.
Proof Sketch
Decompose the treated-group change:
Add and subtract :
The second term is the counterfactual trend for the treated group, which is unobservable. Parallel trends replaces it with the observed control-group trend:
and SUTVA + control-not-treated identifies the right-hand side as . Substituting and rearranging gives the DiD identity.
Why It Matters
DiD's strength is that the parallel-trends assumption is partially testable: with three or more pre-treatment periods you can plot pre-trends and check visually whether they were parallel. A pre-trends test that fails is sufficient to reject the design; a pre-trends test that passes is necessary but not sufficient (the trends could diverge in the post-period for reasons unrelated to past behavior). Card and Krueger (1994) is the canonical application: compare New Jersey fast-food employment to eastern Pennsylvania after a NJ minimum-wage increase.
Failure Mode
Heterogeneous treatment timing. When units adopt treatment at different times, the standard two-way fixed-effects (TWFE) estimator does not identify a positively weighted average of unit-level effects. Goodman-Bacon (2021) and de Chaisemartin and D'Haultfoeuille (2020) show TWFE implicitly uses already-treated units as controls for newly-treated units, contaminating the estimand. Use Callaway-Sant'Anna (2021) or stacked DiD instead.
Anticipation effects. If treated units adjust behavior before the policy takes effect (e.g., firms front-load hiring before a tax rise), for treated units already reflects treatment, breaking no-anticipation. Diagnostic: an event study should show flat coefficients in pre-periods.
Compositional changes. If the population in the treated cell changes between periods (in-migration, attrition), the same group is no longer the same units, and the difference includes selection. Diagnostic: balance pre-treatment covariates by period.
Regression discontinuity
Setting: treatment for a known cutoff on a running variable . Outcome . The "sharp" RDD assumes deterministic assignment; the "fuzzy" RDD allows imperfect compliance and is essentially IV with the cutoff as instrument.
RDD identification of the LATE at the cutoff (Hahn-Todd-Van der Klaauw 2001)
Intuition
Units just above and just below the cutoff are nearly identical on every covariate (observed and unobserved), because crossing the cutoff was effectively a coin flip for borderline units. The treatment status is the only thing that systematically differs. Any jump in at must be the causal effect of treatment, evaluated at .
Proof Sketch
By continuity,
The first equality holds because for all units are untreated, so , and continuity allows the limit. The second holds symmetrically for . Subtracting:
The estimand is a conditional ATE at , not the unconditional ATE. Without further assumptions, you cannot extrapolate to units far from the cutoff.
Why It Matters
RDD is the closest thing to a randomized experiment in non-experimental policy data. The continuity assumption is mild and partially testable (McCrary 2008 density test for manipulation; covariate-balance plots for unobserved selection). Estimation in practice uses local linear or polynomial regression on each side of the cutoff with an MSE-optimal bandwidth (Calonico-Cattaneo-Titiunik 2014). Imbens and Lemieux (2008) is the canonical practitioner survey.
Failure Mode
Manipulation around the cutoff. If units can precisely control (e.g., test scores when teachers grade their own students; income when applying for a means-tested benefit), the density has a jump at and units just above the cutoff differ systematically from units just below. The McCrary 2008 test detects this; a significant density discontinuity invalidates the RDD.
Compound treatments at the cutoff. If multiple programs share the same eligibility cutoff (a means-tested credit triggers eligibility for two unrelated programs), the RDD identifies the joint effect of all of them, not the policy of interest. Read the institutional details before estimating.
Bandwidth dependence. Estimates can be sensitive to bandwidth choice. Calonico-Cattaneo-Titiunik 2014 give bias-corrected confidence intervals robust to MSE-optimal bandwidth selection; report the optimal bandwidth and run sensitivity to halving and doubling it.
Functional-form artefacts. Global high-order polynomial fits (e.g., quartic on each side) introduce edge effects that masquerade as discontinuities. Gelman and Imbens (2019) show why local linear is preferred over global polynomial.
Instrumental variables and the LATE
When is endogenous (correlated with unobserved confounders), an instrument that affects but only affects through identifies a causal effect on a specific subpopulation. The Imbens-Angrist (1994) LATE theorem makes this precise.
Let denote the potential treatment status under instrument value . Each unit has a type:
- Always-takers: .
- Never-takers: .
- Compliers: .
- Defiers: .
Local Average Treatment Effect (Imbens-Angrist 1994)
Intuition
The instrument is a randomized nudge: units with are pushed toward treatment relative to . The numerator of the Wald ratio is the reduced-form effect of the nudge on outcomes (an ITT-like quantity). The denominator is the first-stage effect of the nudge on actual treatment uptake. The ratio rescales: per unit of induced treatment, what is the outcome change? Under monotonicity, the only units the nudge moves are compliers, so the ratio is the average effect on compliers.
Proof Sketch
By exogeneity and exclusion, . Decompose by type:
- For always-takers, , so the difference is .
- For never-takers, , so the difference is .
- For compliers, , so the difference is .
- For defiers, , so the difference is .
Monotonicity rules out defiers. The numerator becomes
The denominator is exactly under the same arguments (always-takers and never-takers contribute zero, defiers excluded). The ratio gives .
Why It Matters
LATE is the cleanest identification result for endogenous treatments and is the workhorse of natural-experiment economics. Angrist (1990) used the Vietnam draft lottery as an instrument for veteran status; Angrist and Krueger (1991) used quarter-of-birth as an instrument for years of schooling. The result also clarifies a previously fuzzy claim: IV does not identify the ATE; it identifies a complier-specific effect whose policy relevance depends on who the compliers are. A policy that targets always-takers gains nothing from a LATE estimated off compliers.
Failure Mode
Weak instruments. If is small, the denominator is near zero and tiny violations of exclusion are amplified into large bias. Stock-Yogo (2005) tabulate first-stage F-statistic thresholds (rule of thumb: for one instrument; tighter cutoffs from Lee-McCrary-Moreira-Porter 2022 with weak-IV-robust inference).
Exclusion violations. may affect through paths other than . The quarter-of-birth instrument (Angrist-Krueger 1991) is challenged on the grounds that birth season correlates with maternal characteristics that independently affect earnings.
Monotonicity violations. "Defiers" sound exotic but are common when the instrument is preference-based (a free product offer might raise uptake among most consumers but reduce it among prestige-sensitive ones). De Chaisemartin (2017) develops weaker "compliers-defiers" identification.
Heterogeneous treatment effects. Multiple-instrument 2SLS does not average LATEs in a policy-relevant way; it gives an overidentification-weighted combination of instrument-specific LATEs, often with negative weights (Mogstad-Torgovitsky-Walters 2021). Use marginal-treatment-effect (MTE) frameworks (Heckman-Vytlacil 2005) when this matters.
Beyond DiD/RDD/IV: synthetic control and double ML
Synthetic control (Abadie-Diamond-Hainmueller 2010). For a single treated unit (a country, a region), construct a weighted average of untreated units whose pre-treatment outcomes match the treated unit, then compare post-treatment trajectories. The weights are chosen to minimize pre-period outcome distance. Inference is permutation-based: re-run the procedure with each donor as a placebo treated and compare the treated-unit gap to the placebo distribution. Used to estimate the effect of California's 1988 tobacco-control program (Proposition 99); the synthetic-California control predicts what California cigarette consumption would have been absent the law.
Double/debiased ML (Chernozhukov-Chetverikov-Demirer-Duflo-Hansen-Newey-Robins 2018). When you have many controls and want to estimate a low-dimensional treatment effect from a partially-linear model , lasso/forest/NN estimates of have first-order bias that does not vanish at . Double ML uses Neyman-orthogonal moment conditions plus cross-fitting (estimate on one fold, plug into moments on the other) to recover a -consistent, asymptotically normal :
even when converges only at rate . The same recipe extends to DiD, IV, and partially linear quantile models. The crucial caveat: identification still comes from the research design. Double ML buys robustness to nuisance specification, not identification.
Common Confusions
Parallel trends is not parallel levels
DiD does not require treated and control units to have the same pre-period level of the outcome. It requires their counterfactual trends in the post-period to be the same. Mechanical convergence or divergence that predates treatment violates the assumption even when levels match at baseline, and matching levels tells you nothing about whether trends would have continued in parallel.
LATE is not the ATE, the ATT, or the policy effect
The Wald estimand identifies the average effect on compliers under the chosen instrument. Always-takers and never-takers contribute zero; their effects are not identified at all by IV with this instrument. A different instrument induces a different complier subpopulation and identifies a different LATE. "The IV estimate is 0.3" only means something once you know who the compliers are.
Double ML does not create identification
Replacing OLS controls with a gradient-boosted residualizer does not turn an observational study into a causal one. If the treatment-assignment mechanism is not captured by the controls, the estimate is biased regardless of how flexible the nuisance model is. Double ML buys robustness to nuisance specification, not identification.
RDD identifies a local effect
The RDD estimand is the conditional ATE at . Extrapolating to units far from the cutoff requires extra structure (a parametric model of , or marginal-treatment-effect machinery). A bipartisan-popular policy that helps people just under the eligibility cutoff might harm people far below it, and the RDD will not detect that.
Significant pre-trends test does not save you
A common mistake is to run a pre-trends F-test, fail to reject, and treat that as evidence that parallel trends holds. Pre-trends tests have low power against the kinds of small, persistent divergences that bias DiD estimates (Roth 2022 makes this quantitative). An insignificant pre-trends test is necessary but never sufficient.
Worked Example: minimum wage as DiD
Card and Krueger (1994) compared full-time-equivalent (FTE) employment at fast-food restaurants in New Jersey and eastern Pennsylvania before and after NJ's April 1992 minimum-wage increase from 5.05/hour.
| Group | FTE Feb 1992 | FTE Nov 1992 | |
|---|---|---|---|
| New Jersey (treated) | 20.4 | 21.0 | +0.6 |
| Eastern PA (control) | 23.3 | 21.2 | -2.1 |
DiD estimate: FTE per restaurant. Interpreted causally under parallel trends: the minimum-wage increase raised rather than lowered NJ employment, contradicting the textbook supply-demand prediction. The result was contested empirically (Neumark-Wascher 2000 reanalyzed payroll data and found a negative effect) and methodologically (sensitivity to the control area), and the broader minimum-wage literature has converged on small employment effects with substantial heterogeneity.
The methodological lesson is independent of the substantive conclusion: the design made its identifying assumption explicit. Critics could attack parallel trends, propose alternative controls, or question SUTVA (cross-border employment shifts), and the debate stayed about identification rather than functional form.
Problem
You estimate a DiD on the minimum-wage data and get FTE per restaurant. A reviewer points out that NJ and eastern PA had different pre-period employment trends in 1989-1991: NJ was rising at FTE/year while eastern PA was flat. Re-derive the DiD identification proof under the assumption that the pre-trend gap of FTE/year would have continued in 1992 absent treatment, and adjust the DiD estimate accordingly. Then state precisely what assumption the adjustment requires beyond plain parallel trends.
Problem
A school district admits students into a gifted-and-talented program if and only if their entrance test score is at least . Five years later, you observe high-school GPA . You estimate an RDD with local linear regression on each side of , MSE-optimal bandwidth, and find GPA points.
(a) State precisely what causal quantity estimates. (b) A colleague claims this means "expanding the program to students at would raise their GPA by 0.4". Why is this wrong? (c) The McCrary density test on shows a small but statistically significant jump at . What threats does this raise, and what should you check?
Problem
Show by example that monotonicity in the LATE theorem is essential, not a technicality. Construct a population with always-takers, never-takers, compliers, and a small number of defiers, with explicit potential outcomes, such that the Wald estimand recovers a number that is not the average treatment effect on any subpopulation of interest (not the ATE, not the ATT, not the LATE on compliers).
Then explain why de Chaisemartin's (2017) "compliers-defiers" framework can still extract policy-relevant information from such a population, and what extra assumption it needs.
References
Related Topics
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- Causal Inference BasicsLayer 3
- Hypothesis Testing for MLLayer 2
- Causal Inference and the Ladder of CausationLayer 3
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Bayesian EstimationLayer 0B
- Maximum Likelihood EstimationLayer 0B
- Differentiation in RnLayer 0A