Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Foundations

Longitudinal Surveys and Panel Data

Analysis of data where the same units are measured repeatedly over time: fixed effects, random effects, difference-in-differences, and the problems of attrition and time-varying confounding.

AdvancedTier 3Stable~50 min

Prerequisites

0

Why This Matters

Cross-sectional data gives you a snapshot: differences between people at one point in time. Longitudinal data gives you a movie: changes within the same person over time. This distinction is critical for causal inference because cross-sectional differences confound within-person changes with between-person differences.

If you observe that people who exercise more earn more, is it because exercise increases earnings, or because healthier people (who exercise more) also tend to be better educated? Cross-sectional data cannot separate these explanations. Longitudinal data can, by tracking the same person over time and asking: when this person starts exercising more, do their earnings change?

Mental Model

You observe NN units (people, firms, countries) at TT time points. The data is {yit,xit}\{y_{it}, x_{it}\} for unit i=1,,Ni = 1, \ldots, N and time t=1,,Tt = 1, \ldots, T. Each unit has unobserved characteristics αi\alpha_i (ability, motivation, genetics) that are constant over time but vary across units. The question is how to handle these unobserved unit-specific effects.

Core Definitions

Definition

Panel Data

Panel data (also called longitudinal data) consists of observations on the same set of units across multiple time periods. A balanced panel has observations for all NN units at all TT time periods (NTNT observations total). An unbalanced panel has some missing observations due to attrition, late entry, or intermittent nonresponse.

Definition

Cross-Sectional vs. Longitudinal Design

A cross-sectional design samples different units at each time point. It can track population-level changes but cannot identify individual-level changes. A longitudinal design follows the same units over time. It can separate within-unit change from between-unit differences.

Repeated cross-sections (like the Current Population Survey) sample different people each month. Panel surveys (like the PSID or NLSY) follow the same people for years or decades.

The Panel Data Model

The standard linear panel data model is:

yit=xitTβ+αi+ϵity_{it} = x_{it}^T \beta + \alpha_i + \epsilon_{it}

where yity_{it} is the outcome for unit ii at time tt, xitx_{it} are observed time-varying covariates, αi\alpha_i is the unobserved unit-specific effect, and ϵit\epsilon_{it} is the idiosyncratic error with E[ϵitxi1,,xiT,αi]=0\mathbb{E}[\epsilon_{it} \mid x_{i1}, \ldots, x_{iT}, \alpha_i] = 0.

The central question: is αi\alpha_i correlated with xitx_{it}?

Fixed Effects

Definition

Fixed Effects Model

The fixed effects (FE) model treats αi\alpha_i as an arbitrary unit-specific constant that may be correlated with xitx_{it}. Estimation proceeds by removing αi\alpha_i through the within transformation: subtract the unit mean from each variable.

yityˉi=(xitxˉi)Tβ+(ϵitϵˉi)y_{it} - \bar{y}_i = (x_{it} - \bar{x}_i)^T \beta + (\epsilon_{it} - \bar{\epsilon}_i)

where yˉi=1Ttyit\bar{y}_i = \frac{1}{T}\sum_t y_{it}. This "demeans" the data, eliminating αi\alpha_i. OLS on the demeaned data gives the within estimator β^FE\hat{\beta}_{\text{FE}}.

Random Effects

Definition

Random Effects Model

The random effects (RE) model assumes αi\alpha_i is a random variable uncorrelated with xitx_{it}: Cov(αi,xit)=0\text{Cov}(\alpha_i, x_{it}) = 0. Under this assumption, the model is a linear model with a compound error αi+ϵit\alpha_i + \epsilon_{it}. GLS estimation exploits the error structure to produce an estimator that is more efficient than FE (it uses both within and between variation).

The RE estimator is a matrix-weighted average of the within (FE) and between estimators. It is more efficient than FE when the RE assumption holds, but inconsistent when it does not.

Main Theorems

Theorem

Consistency of the Fixed Effects Estimator

Statement

Under the panel model yit=xitTβ+αi+ϵity_{it} = x_{it}^T\beta + \alpha_i + \epsilon_{it} with strict exogeneity E[ϵitxi1,,xiT,αi]=0\mathbb{E}[\epsilon_{it} \mid x_{i1}, \ldots, x_{iT}, \alpha_i] = 0 and rank(tE[x¨itx¨itT])=k\text{rank}(\sum_t \mathbb{E}[\ddot{x}_{it}\ddot{x}_{it}^T]) = k (where x¨it=xitxˉi\ddot{x}_{it} = x_{it} - \bar{x}_i), the within estimator is consistent for β\beta as NN \to \infty with TT fixed:

β^FE=(i=1Nt=1Tx¨itx¨itT)1i=1Nt=1Tx¨ity¨itpβ\hat{\beta}_{\text{FE}} = \left(\sum_{i=1}^N \sum_{t=1}^T \ddot{x}_{it}\ddot{x}_{it}^T\right)^{-1} \sum_{i=1}^N \sum_{t=1}^T \ddot{x}_{it}\ddot{y}_{it} \xrightarrow{p} \beta

This holds regardless of whether αi\alpha_i is correlated with xitx_{it}.

Intuition

By subtracting unit means, the within transformation removes all time-invariant confounders (observed or unobserved). What remains is purely within-unit variation: how changes in xitx_{it} for a given unit ii relate to changes in yity_{it} for that same unit. This eliminates selection bias due to time-invariant unobservables.

Proof Sketch

After the within transformation, y¨it=x¨itTβ+ϵ¨it\ddot{y}_{it} = \ddot{x}_{it}^T\beta + \ddot{\epsilon}_{it}. Since αi\alpha_i has been differenced out, OLS on the demeaned equation identifies β\beta. By the law of large numbers (applied as NN \to \infty), β^FEβ\hat{\beta}_{\text{FE}} \to \beta because E[x¨itϵ¨it]=0\mathbb{E}[\ddot{x}_{it}\ddot{\epsilon}_{it}] = 0 follows from strict exogeneity.

Why It Matters

Fixed effects identification is one of the most powerful tools in applied economics and social science. It controls for all time-invariant confounders without needing to observe or measure them. This is why panel data is so valuable for causal inference: if the confounders are fixed characteristics of units, FE eliminates them.

Failure Mode

FE cannot identify the effect of time-invariant variables (gender, race, country of birth) because these are absorbed into αi\alpha_i. FE requires strict exogeneity, which fails with lagged dependent variables (xitx_{it} includes yi,t1y_{i,t-1}) or feedback effects. With small TT, the incidental parameters problem biases nonlinear FE models (logit, Poisson). FE is also inefficient if αi\alpha_i is actually uncorrelated with xitx_{it}, in which case RE is better.

Difference-in-Differences

Definition

Difference-in-Differences (DiD)

Difference-in-differences is a method for estimating causal effects from panel data with a treatment that affects some units but not others at a specific time. With two periods (t=0,1t = 0, 1) and two groups (treated, control):

δ^DiD=(yˉ1,treatedyˉ0,treated)(yˉ1,controlyˉ0,control)\hat{\delta}_{\text{DiD}} = (\bar{y}_{1,\text{treated}} - \bar{y}_{0,\text{treated}}) - (\bar{y}_{1,\text{control}} - \bar{y}_{0,\text{control}})

The first difference removes unit-specific time-invariant confounders. The second difference removes common time trends. The identifying assumption is parallel trends: absent treatment, the treated and control groups would have had the same time trend.

DiD is equivalent to fixed effects with a treatment dummy in a two-period, two-group setting. It generalizes to multiple periods and staggered treatment adoption, though recent research shows the generalization requires care (see de Chaisemartin & D'Haultfoeuille, 2020).

Attrition

Attrition is the defining practical problem of longitudinal studies. People move, die, refuse to participate, or become unreachable. If attrition is related to the outcome, the remaining sample is not representative of the original sample.

Testing for attrition bias: compare baseline characteristics of stayers vs. leavers. If they differ, attrition is non-random. Corrections include inverse probability weighting (weight remaining observations by the inverse of their probability of staying) and multiple imputation.

Major Panel Surveys

  • PSID (Panel Study of Income Dynamics): U.S. families, since 1968. The longest running household panel survey in the world.
  • NLSY (National Longitudinal Survey of Youth): two cohorts (1979, 1997) of U.S. youth tracked into adulthood.
  • LISS (Longitudinal Internet Studies for the Social Sciences): Dutch probability-based internet panel.
  • BHPS/Understanding Society: UK households, now part of the UK Household Longitudinal Study.
  • SOEP (German Socio-Economic Panel): German households since 1984.

Common Confusions

Watch Out

Fixed effects does not mean the effects are fixed

The name is confusing. "Fixed effects" means the unit-specific intercepts αi\alpha_i are treated as fixed (non-random) parameters. It does not mean the regression coefficients β\beta are fixed or non-varying. The alternative, "random effects," treats αi\alpha_i as draws from a distribution.

Watch Out

The Hausman test is not a test of whether to use FE or RE

The Hausman test checks whether the RE and FE estimates are statistically different. If they are, this suggests Cov(αi,xit)0\text{Cov}(\alpha_i, x_{it}) \neq 0 and RE is inconsistent. But a non-significant Hausman test does not prove Cov(αi,xit)=0\text{Cov}(\alpha_i, x_{it}) = 0. It may just lack power. In practice, if you have reason to believe there are unobserved confounders correlated with regressors, use FE regardless of the Hausman test.

Watch Out

Panel data does not automatically solve endogeneity

FE controls for time-invariant confounders. It does not control for time-varying confounders. If an omitted variable changes over time and is correlated with xitx_{it}, FE does not eliminate the bias. Panel data helps, but it is not a cure-all for endogeneity.

Summary

  • Panel data tracks the same units over time, enabling within-unit comparisons
  • Fixed effects removes all time-invariant confounders by demeaning
  • Random effects is more efficient but requires αi\alpha_i uncorrelated with xitx_{it}
  • Difference-in-differences uses parallel trends to identify causal effects
  • Attrition is the major practical threat: dropouts are rarely random
  • FE cannot identify effects of time-invariant variables

Exercises

ExerciseCore

Problem

You have a panel of 500 workers observed over 5 years. You regress log wages on years of education using OLS, FE, and RE. The OLS coefficient is 0.10, the RE coefficient is 0.08, and the FE coefficient is 0.04. Interpret the differences. Why is the FE estimate smallest?

ExerciseAdvanced

Problem

A policy is implemented in state A in 2020 but not in state B. Average outcomes are: State A pre-2020: 50, State A post-2020: 58, State B pre-2020: 45, State B post-2020: 48. Compute the DiD estimate. State the parallel trends assumption in plain English. Give one reason it might fail.

References

Canonical:

  • Wooldridge, Econometric Analysis of Cross Section and Panel Data (2010), Chapters 10-11
  • Hsiao, Analysis of Panel Data (2014), Chapters 2-4

Current:

  • Angrist & Pischke, Mostly Harmless Econometrics (2009), Chapter 5

  • de Chaisemartin & D'Haultfoeuille, "Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects" (2020), AER

  • Casella & Berger, Statistical Inference (2002), Chapters 5-10

  • Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics