Methodology
Meta-Analysis
Combining results from multiple studies: fixed-effect and random-effects models, heterogeneity measures, publication bias, and why this matters for ML benchmarking.
Prerequisites
Why This Matters
ML papers report results on benchmarks. Different papers use different datasets, different hyperparameters, different compute budgets. When five papers compare method A to method B and get different effect sizes, how do you draw a conclusion? Meta-analysis provides the formal framework for combining evidence across studies.
The same problems that plague medical meta-analysis (publication bias, heterogeneity, p-hacking) plague ML. Papers that show improvement get published; papers that show no improvement do not. Understanding meta-analysis helps you read the ML literature critically.
Setup and Notation
Suppose we have studies, each estimating some effect size with estimate and variance (assumed known or well-estimated).
Effect Size
The effect size is the quantity each study estimates. In medical research, this might be a treatment effect. In ML, it is typically the performance difference between two methods (e.g., accuracy of method A minus accuracy of method B) on a particular benchmark.
Inverse-Variance Weight
Studies with smaller variance (more precise estimates) get more weight. The inverse-variance weight for study is .
Fixed-Effect Model
The fixed-effect model assumes all studies estimate the same true effect . Differences across studies are due to sampling variability only.
Fixed-Effect Meta-Analytic Estimator
Statement
The optimal combined estimator under the fixed-effect model is:
with variance , where .
Intuition
This is a weighted average of the study estimates, weighting each by its precision. More precise studies contribute more. If all studies had equal precision, it reduces to the simple average.
Proof Sketch
This is the minimum-variance unbiased estimator for the common mean given independent estimates with known variances. It follows from the general theory of weighted least squares, or equivalently, from maximum likelihood under Gaussian sampling distributions.
Why It Matters
The fixed-effect estimator is the simplest way to combine results. Its variance decreases as , so adding more studies always increases precision. But the model is only valid when the true effect is the same across all studies.
Failure Mode
If the true effect varies across studies (heterogeneity), the fixed-effect model underestimates the uncertainty. The confidence interval is too narrow, and you get false precision. In ML, different datasets, architectures, and compute budgets almost certainly introduce heterogeneity.
Random-Effects Model
The random-effects model assumes each study has its own true effect , drawn from a distribution with mean and between-study variance .
The goal is to estimate (the average effect across the population of studies).
DerSimonian-Laird Random-Effects Estimator
Statement
The random-effects combined estimator is:
where and is estimated from the data.
Intuition
The weights now account for both within-study variance and between-study variance . When is large, all weights become more equal because between-study variability dominates within-study precision. When , the random-effects model reduces to the fixed-effect model.
Proof Sketch
The marginal distribution of is . Apply inverse-variance weighting with the total variance . The DerSimonian-Laird method estimates using the method of moments from Cochran's Q statistic.
Why It Matters
The random-effects model is more realistic for ML. Different benchmarks, architectures, and experimental setups introduce genuine variability. The random-effects estimator gives wider (more honest) confidence intervals that account for this heterogeneity.
Failure Mode
The DerSimonian-Laird estimator can produce even when true heterogeneity exists, especially with few studies. Alternative estimators (REML, Paule-Mandel) are more robust. Also, the random-effects model assumes normality of the study-level effects, which may not hold.
Measuring Heterogeneity
Cochran's Q Statistic
Under the null of no heterogeneity, . A large indicates the studies are not estimating the same effect.
I-squared Statistic
estimates the percentage of total variability due to between-study heterogeneity rather than sampling error. Rough guidelines: low, - moderate, high.
Publication Bias
Studies with statistically significant results are more likely to be published. This creates a biased sample of the literature.
Funnel plot: Plot effect size against study precision (or sample size). Under no bias, the plot should be symmetric around the combined estimate. Asymmetry suggests smaller, less precise studies with null results are missing.
Trim and fill: A method that imputes missing studies to make the funnel plot symmetric, then recomputes the combined estimate. Useful for sensitivity analysis, not definitive proof of bias.
Fisher's method for combining p-values: Given independent p-values :
under the null that all effects are zero. This tests whether there is any signal, but does not estimate the effect size.
Meta-Regression
When heterogeneity exists, meta-regression models the effect size as a function of study-level covariates. For example: effect = baseline + coefficient * dataset_size. This is analogous to a weighted regression where each study contributes one data point.
Connection to ML
In ML, meta-analysis appears in several guises:
- Benchmark aggregation: Comparing methods across multiple datasets. A table of accuracy numbers across 10 datasets is raw material for meta-analysis.
- Systematic reviews: Papers like "A Survey of X" often implicitly perform meta-analysis when summarizing results.
- Hyperparameter sensitivity: How much does performance vary across random seeds, datasets, or hyperparameter choices? This is heterogeneity analysis.
- Reproducibility studies: When a result fails to replicate, formal meta-analysis of the original and replication attempts is the correct way to update your belief.
Common Confusions
Fixed-effect does not mean fixed effects in the regression sense
In meta-analysis, "fixed-effect" means assuming one true effect across all studies. This is different from "fixed effects" in panel data regression (which means study-specific intercepts). The terminology collision is unfortunate.
I-squared does not measure the magnitude of heterogeneity
means 75% of the observed variability is due to between-study differences, not that the effect sizes vary by 75%. A set of studies could have with all effects between 0.50 and 0.55. The prediction interval (not the confidence interval) gives the range of plausible effects.
Key Takeaways
- Fixed-effect model: one true effect, weight by inverse variance
- Random-effects model: effects vary across studies, accounts for between-study variance
- Heterogeneity: measured by and . High means the effect is not consistent
- Publication bias: funnel plots and trim-and-fill for diagnosis
- For ML: treat benchmark tables as raw meta-analytic data and think critically about missing results
Exercises
Problem
Three studies estimate an effect with and standard errors . Compute the fixed-effect combined estimate .
Problem
Explain why the random-effects combined estimate is always closer to the unweighted average than the fixed-effect estimate. Under what condition are they equal?
References
Canonical:
- Borenstein, Hedges, Higgins, Rothstein, Introduction to Meta-Analysis (2009), Chapters 1-16
- DerSimonian & Laird, "Meta-Analysis in Clinical Trials" (1986), Controlled Clinical Trials
Current:
-
Higgins et al., Cochrane Handbook for Systematic Reviews of Interventions (2019), Chapter 10
-
Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (2021)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
Next Topics
- Reproducibility and experimental rigor: why replication matters and how to do it
- P-hacking and multiple testing: the statistical sins that meta-analysis must contend with
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Hypothesis Testing for MLLayer 2
- Bayesian EstimationLayer 0B
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A