Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Meta-Analysis

Combining results from multiple studies: fixed-effect and random-effects models, heterogeneity measures, publication bias, and why this matters for ML benchmarking.

CoreTier 2Stable~50 min
0

Why This Matters

ML papers report results on benchmarks. Different papers use different datasets, different hyperparameters, different compute budgets. When five papers compare method A to method B and get different effect sizes, how do you draw a conclusion? Meta-analysis provides the formal framework for combining evidence across studies.

The same problems that plague medical meta-analysis (publication bias, heterogeneity, p-hacking) plague ML. Papers that show improvement get published; papers that show no improvement do not. Understanding meta-analysis helps you read the ML literature critically.

Setup and Notation

Suppose we have kk studies, each estimating some effect size θi\theta_i with estimate θ^i\hat{\theta}_i and variance σi2\sigma_i^2 (assumed known or well-estimated).

Definition

Effect Size

The effect size is the quantity each study estimates. In medical research, this might be a treatment effect. In ML, it is typically the performance difference between two methods (e.g., accuracy of method A minus accuracy of method B) on a particular benchmark.

Definition

Inverse-Variance Weight

Studies with smaller variance (more precise estimates) get more weight. The inverse-variance weight for study ii is wi=1/σi2w_i = 1/\sigma_i^2.

Fixed-Effect Model

The fixed-effect model assumes all studies estimate the same true effect θ\theta. Differences across studies are due to sampling variability only.

Theorem

Fixed-Effect Meta-Analytic Estimator

Statement

The optimal combined estimator under the fixed-effect model is:

θ^FE=i=1kwiθ^ii=1kwi\hat{\theta}_{\text{FE}} = \frac{\sum_{i=1}^{k} w_i \hat{\theta}_i}{\sum_{i=1}^{k} w_i}

with variance Var(θ^FE)=1/i=1kwi\text{Var}(\hat{\theta}_{\text{FE}}) = 1 / \sum_{i=1}^{k} w_i, where wi=1/σi2w_i = 1/\sigma_i^2.

Intuition

This is a weighted average of the study estimates, weighting each by its precision. More precise studies contribute more. If all studies had equal precision, it reduces to the simple average.

Proof Sketch

This is the minimum-variance unbiased estimator for the common mean θ\theta given independent estimates with known variances. It follows from the general theory of weighted least squares, or equivalently, from maximum likelihood under Gaussian sampling distributions.

Why It Matters

The fixed-effect estimator is the simplest way to combine results. Its variance decreases as 1/wi1/\sum w_i, so adding more studies always increases precision. But the model is only valid when the true effect is the same across all studies.

Failure Mode

If the true effect varies across studies (heterogeneity), the fixed-effect model underestimates the uncertainty. The confidence interval is too narrow, and you get false precision. In ML, different datasets, architectures, and compute budgets almost certainly introduce heterogeneity.

Random-Effects Model

The random-effects model assumes each study has its own true effect θi\theta_i, drawn from a distribution with mean μ\mu and between-study variance τ2\tau^2.

θ^iN(θi,σi2),θiN(μ,τ2)\hat{\theta}_i \sim N(\theta_i, \sigma_i^2), \quad \theta_i \sim N(\mu, \tau^2)

The goal is to estimate μ\mu (the average effect across the population of studies).

Theorem

DerSimonian-Laird Random-Effects Estimator

Statement

The random-effects combined estimator is:

μ^RE=i=1kwiθ^ii=1kwi\hat{\mu}_{\text{RE}} = \frac{\sum_{i=1}^{k} w_i^* \hat{\theta}_i}{\sum_{i=1}^{k} w_i^*}

where wi=1/(σi2+τ^2)w_i^* = 1/(\sigma_i^2 + \hat{\tau}^2) and τ^2\hat{\tau}^2 is estimated from the data.

Intuition

The weights now account for both within-study variance σi2\sigma_i^2 and between-study variance τ2\tau^2. When τ2\tau^2 is large, all weights become more equal because between-study variability dominates within-study precision. When τ2=0\tau^2 = 0, the random-effects model reduces to the fixed-effect model.

Proof Sketch

The marginal distribution of θ^i\hat{\theta}_i is N(μ,σi2+τ2)N(\mu, \sigma_i^2 + \tau^2). Apply inverse-variance weighting with the total variance σi2+τ2\sigma_i^2 + \tau^2. The DerSimonian-Laird method estimates τ2\tau^2 using the method of moments from Cochran's Q statistic.

Why It Matters

The random-effects model is more realistic for ML. Different benchmarks, architectures, and experimental setups introduce genuine variability. The random-effects estimator gives wider (more honest) confidence intervals that account for this heterogeneity.

Failure Mode

The DerSimonian-Laird estimator can produce τ^2=0\hat{\tau}^2 = 0 even when true heterogeneity exists, especially with few studies. Alternative estimators (REML, Paule-Mandel) are more robust. Also, the random-effects model assumes normality of the study-level effects, which may not hold.

Measuring Heterogeneity

Definition

Cochran's Q Statistic

Q=i=1kwi(θ^iθ^FE)2Q = \sum_{i=1}^{k} w_i (\hat{\theta}_i - \hat{\theta}_{\text{FE}})^2

Under the null of no heterogeneity, Qχk12Q \sim \chi^2_{k-1}. A large QQ indicates the studies are not estimating the same effect.

Definition

I-squared Statistic

I2=max(0,Q(k1)Q)×100%I^2 = \max\left(0, \frac{Q - (k-1)}{Q}\right) \times 100\%

I2I^2 estimates the percentage of total variability due to between-study heterogeneity rather than sampling error. Rough guidelines: I2<25%I^2 < 25\% low, 2525-75%75\% moderate, >75%> 75\% high.

Publication Bias

Studies with statistically significant results are more likely to be published. This creates a biased sample of the literature.

Funnel plot: Plot effect size against study precision (or sample size). Under no bias, the plot should be symmetric around the combined estimate. Asymmetry suggests smaller, less precise studies with null results are missing.

Trim and fill: A method that imputes missing studies to make the funnel plot symmetric, then recomputes the combined estimate. Useful for sensitivity analysis, not definitive proof of bias.

Fisher's method for combining p-values: Given independent p-values p1,,pkp_1, \ldots, p_k:

2i=1kln(pi)χ2k2-2\sum_{i=1}^{k} \ln(p_i) \sim \chi^2_{2k}

under the null that all effects are zero. This tests whether there is any signal, but does not estimate the effect size.

Meta-Regression

When heterogeneity exists, meta-regression models the effect size as a function of study-level covariates. For example: effect = baseline + coefficient * dataset_size. This is analogous to a weighted regression where each study contributes one data point.

Connection to ML

In ML, meta-analysis appears in several guises:

  1. Benchmark aggregation: Comparing methods across multiple datasets. A table of accuracy numbers across 10 datasets is raw material for meta-analysis.
  2. Systematic reviews: Papers like "A Survey of X" often implicitly perform meta-analysis when summarizing results.
  3. Hyperparameter sensitivity: How much does performance vary across random seeds, datasets, or hyperparameter choices? This is heterogeneity analysis.
  4. Reproducibility studies: When a result fails to replicate, formal meta-analysis of the original and replication attempts is the correct way to update your belief.

Common Confusions

Watch Out

Fixed-effect does not mean fixed effects in the regression sense

In meta-analysis, "fixed-effect" means assuming one true effect across all studies. This is different from "fixed effects" in panel data regression (which means study-specific intercepts). The terminology collision is unfortunate.

Watch Out

I-squared does not measure the magnitude of heterogeneity

I2=75%I^2 = 75\% means 75% of the observed variability is due to between-study differences, not that the effect sizes vary by 75%. A set of studies could have I2=75%I^2 = 75\% with all effects between 0.50 and 0.55. The prediction interval (not the confidence interval) gives the range of plausible effects.

Key Takeaways

  • Fixed-effect model: one true effect, weight by inverse variance
  • Random-effects model: effects vary across studies, accounts for between-study variance τ2\tau^2
  • Heterogeneity: measured by QQ and I2I^2. High I2I^2 means the effect is not consistent
  • Publication bias: funnel plots and trim-and-fill for diagnosis
  • For ML: treat benchmark tables as raw meta-analytic data and think critically about missing results

Exercises

ExerciseCore

Problem

Three studies estimate an effect with θ^1=0.5,θ^2=0.3,θ^3=0.7\hat{\theta}_1 = 0.5, \hat{\theta}_2 = 0.3, \hat{\theta}_3 = 0.7 and standard errors σ1=0.1,σ2=0.2,σ3=0.15\sigma_1 = 0.1, \sigma_2 = 0.2, \sigma_3 = 0.15. Compute the fixed-effect combined estimate θ^FE\hat{\theta}_{\text{FE}}.

ExerciseAdvanced

Problem

Explain why the random-effects combined estimate is always closer to the unweighted average than the fixed-effect estimate. Under what condition are they equal?

References

Canonical:

  • Borenstein, Hedges, Higgins, Rothstein, Introduction to Meta-Analysis (2009), Chapters 1-16
  • DerSimonian & Laird, "Meta-Analysis in Clinical Trials" (1986), Controlled Clinical Trials

Current:

  • Higgins et al., Cochrane Handbook for Systematic Reviews of Interventions (2019), Chapter 10

  • Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (2021)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics