Statistical Foundations
Longitudinal Surveys and Panel Data
Analysis of data where the same units are measured repeatedly over time: fixed effects, random effects, difference-in-differences, and the problems of attrition and time-varying confounding.
Prerequisites
Why This Matters
Cross-sectional data gives you a snapshot: differences between people at one point in time. Longitudinal data gives you a movie: changes within the same person over time. This distinction is critical for causal inference because cross-sectional differences confound within-person changes with between-person differences.
If you observe that people who exercise more earn more, is it because exercise increases earnings, or because healthier people (who exercise more) also tend to be better educated? Cross-sectional data cannot separate these explanations. Longitudinal data can, by tracking the same person over time and asking: when this person starts exercising more, do their earnings change?
Mental Model
You observe units (people, firms, countries) at time points. The data is for unit and time . Each unit has unobserved characteristics (ability, motivation, genetics) that are constant over time but vary across units. The question is how to handle these unobserved unit-specific effects.
Core Definitions
Panel Data
Panel data (also called longitudinal data) consists of observations on the same set of units across multiple time periods. A balanced panel has observations for all units at all time periods ( observations total). An unbalanced panel has some missing observations due to attrition, late entry, or intermittent nonresponse.
Cross-Sectional vs. Longitudinal Design
A cross-sectional design samples different units at each time point. It can track population-level changes but cannot identify individual-level changes. A longitudinal design follows the same units over time. It can separate within-unit change from between-unit differences.
Repeated cross-sections (like the Current Population Survey) sample different people each month. Panel surveys (like the PSID or NLSY) follow the same people for years or decades.
The Panel Data Model
The standard linear panel data model is:
where is the outcome for unit at time , are observed time-varying covariates, is the unobserved unit-specific effect, and is the idiosyncratic error with .
The central question: is correlated with ?
Fixed Effects
Fixed Effects Model
The fixed effects (FE) model treats as an arbitrary unit-specific constant that may be correlated with . Estimation proceeds by removing through the within transformation: subtract the unit mean from each variable.
where . This "demeans" the data, eliminating . OLS on the demeaned data gives the within estimator .
Random Effects
Random Effects Model
The random effects (RE) model assumes is a random variable uncorrelated with : . Under this assumption, the model is a linear model with a compound error . GLS estimation exploits the error structure to produce an estimator that is more efficient than FE (it uses both within and between variation).
The RE estimator is a matrix-weighted average of the within (FE) and between estimators. It is more efficient than FE when the RE assumption holds, but inconsistent when it does not.
Main Theorems
Consistency of the Fixed Effects Estimator
Statement
Under the panel model with strict exogeneity and (where ), the within estimator is consistent for as with fixed:
This holds regardless of whether is correlated with .
Intuition
By subtracting unit means, the within transformation removes all time-invariant confounders (observed or unobserved). What remains is purely within-unit variation: how changes in for a given unit relate to changes in for that same unit. This eliminates selection bias due to time-invariant unobservables.
Proof Sketch
After the within transformation, . Since has been differenced out, OLS on the demeaned equation identifies . By the law of large numbers (applied as ), because follows from strict exogeneity.
Why It Matters
Fixed effects identification is one of the most powerful tools in applied economics and social science. It controls for all time-invariant confounders without needing to observe or measure them. This is why panel data is so valuable for causal inference: if the confounders are fixed characteristics of units, FE eliminates them.
Failure Mode
FE cannot identify the effect of time-invariant variables (gender, race, country of birth) because these are absorbed into . FE requires strict exogeneity, which fails with lagged dependent variables ( includes ) or feedback effects. With small , the incidental parameters problem biases nonlinear FE models (logit, Poisson). FE is also inefficient if is actually uncorrelated with , in which case RE is better.
Difference-in-Differences
Difference-in-Differences (DiD)
Difference-in-differences is a method for estimating causal effects from panel data with a treatment that affects some units but not others at a specific time. With two periods () and two groups (treated, control):
The first difference removes unit-specific time-invariant confounders. The second difference removes common time trends. The identifying assumption is parallel trends: absent treatment, the treated and control groups would have had the same time trend.
DiD is equivalent to fixed effects with a treatment dummy in a two-period, two-group setting. It generalizes to multiple periods and staggered treatment adoption, though recent research shows the generalization requires care (see de Chaisemartin & D'Haultfoeuille, 2020).
Attrition
Attrition is the defining practical problem of longitudinal studies. People move, die, refuse to participate, or become unreachable. If attrition is related to the outcome, the remaining sample is not representative of the original sample.
Testing for attrition bias: compare baseline characteristics of stayers vs. leavers. If they differ, attrition is non-random. Corrections include inverse probability weighting (weight remaining observations by the inverse of their probability of staying) and multiple imputation.
Major Panel Surveys
- PSID (Panel Study of Income Dynamics): U.S. families, since 1968. The longest running household panel survey in the world.
- NLSY (National Longitudinal Survey of Youth): two cohorts (1979, 1997) of U.S. youth tracked into adulthood.
- LISS (Longitudinal Internet Studies for the Social Sciences): Dutch probability-based internet panel.
- BHPS/Understanding Society: UK households, now part of the UK Household Longitudinal Study.
- SOEP (German Socio-Economic Panel): German households since 1984.
Common Confusions
Fixed effects does not mean the effects are fixed
The name is confusing. "Fixed effects" means the unit-specific intercepts are treated as fixed (non-random) parameters. It does not mean the regression coefficients are fixed or non-varying. The alternative, "random effects," treats as draws from a distribution.
The Hausman test is not a test of whether to use FE or RE
The Hausman test checks whether the RE and FE estimates are statistically different. If they are, this suggests and RE is inconsistent. But a non-significant Hausman test does not prove . It may just lack power. In practice, if you have reason to believe there are unobserved confounders correlated with regressors, use FE regardless of the Hausman test.
Panel data does not automatically solve endogeneity
FE controls for time-invariant confounders. It does not control for time-varying confounders. If an omitted variable changes over time and is correlated with , FE does not eliminate the bias. Panel data helps, but it is not a cure-all for endogeneity.
Summary
- Panel data tracks the same units over time, enabling within-unit comparisons
- Fixed effects removes all time-invariant confounders by demeaning
- Random effects is more efficient but requires uncorrelated with
- Difference-in-differences uses parallel trends to identify causal effects
- Attrition is the major practical threat: dropouts are rarely random
- FE cannot identify effects of time-invariant variables
Exercises
Problem
You have a panel of 500 workers observed over 5 years. You regress log wages on years of education using OLS, FE, and RE. The OLS coefficient is 0.10, the RE coefficient is 0.08, and the FE coefficient is 0.04. Interpret the differences. Why is the FE estimate smallest?
Problem
A policy is implemented in state A in 2020 but not in state B. Average outcomes are: State A pre-2020: 50, State A post-2020: 58, State B pre-2020: 45, State B post-2020: 48. Compute the DiD estimate. State the parallel trends assumption in plain English. Give one reason it might fail.
References
Canonical:
- Wooldridge, Econometric Analysis of Cross Section and Panel Data (2010), Chapters 10-11
- Hsiao, Analysis of Panel Data (2014), Chapters 2-4
Current:
-
Angrist & Pischke, Mostly Harmless Econometrics (2009), Chapter 5
-
de Chaisemartin & D'Haultfoeuille, "Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects" (2020), AER
-
Casella & Berger, Statistical Inference (2002), Chapters 5-10
-
Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6
Next Topics
- Small area estimation: borrowing strength across subpopulations
- Nonresponse and missing data: handling attrition formally
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Linear RegressionLayer 1
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Differentiation in RnLayer 0A