Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Reproducibility and Experimental Rigor

What it takes to make ML experiments truly reproducible: seeds, variance reporting, data hygiene, configuration management, and the discipline of multi-run evaluation.

CoreTier 2Current~50 min
0

Why This Matters

The ML reproducibility crisis is real. Studies have shown that a large fraction of published results cannot be replicated, even by the original authors. If your experiments are not reproducible, your conclusions are not trustworthy.

This is not just an academic concern. If you ship a model to production based on a single lucky run, you will be surprised when retraining produces something worse. Reproducibility connects directly to hypothesis testing: a result that cannot be replicated is a result that would fail a proper statistical test. It also intersects with model evaluation, because evaluation metrics are meaningless without controlled experimental conditions.

Mental Model

A reproducible experiment is one where a stranger. given your code, data, and instructions. can obtain the same results you reported. Not approximately the same. The same, within the variance you documented.

This requires controlling every source of randomness, documenting every decision, and reporting results honestly.

Sources of Randomness

ML experiments have many sources of randomness that must be controlled:

Definition

Random Seed

A random seed is an integer that initializes a pseudorandom number generator to a deterministic state. Setting the same seed produces the same sequence of "random" numbers. In ML, you must set seeds for: the language/framework RNG (Python, NumPy), the deep learning framework (PyTorch, TensorFlow), and CUDA operations (which may use non-deterministic algorithms by default).

Key sources of randomness in a typical ML pipeline:

  1. Weight initialization: different initial weights lead to different optima
  2. Data shuffling: the order of training batches affects optimization
  3. Data augmentation: random crops, flips, and noise differ across runs
  4. Dropout: random mask patterns change every forward pass
  5. CUDA non-determinism: some GPU operations (e.g., atomicAdd in reductions) are non-deterministic by default

Multi-Run Evaluation

Definition

Multi-Run Reporting Standard

Report results as mean and standard deviation over NN independent runs with different random seeds:

metric=xˉ±s=1Ni=1Nxi±1N1i=1N(xixˉ)2\text{metric} = \bar{x} \pm s = \frac{1}{N}\sum_{i=1}^{N} x_i \pm \sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(x_i - \bar{x})^2}

Use N3N \geq 3 at minimum. For claims of small improvements, use N5N \geq 5. The standard error of the mean is s/Ns / \sqrt{N}, which determines how precisely you know the true mean. This connects to the central limit theorem: the sample mean is approximately normal for large NN, justifying the confidence interval construction.

Proposition

Standard Error of the Mean

Statement

Let X1,,XNX_1, \ldots, X_N be independent experiment runs with mean μ\mu and variance σ2\sigma^2. The sample mean Xˉ=1NXi\bar{X} = \frac{1}{N}\sum X_i satisfies:

Var(Xˉ)=σ2N,SE(Xˉ)=σN\text{Var}(\bar{X}) = \frac{\sigma^2}{N}, \quad \text{SE}(\bar{X}) = \frac{\sigma}{\sqrt{N}}

A 95% confidence interval for μ\mu using the tt-distribution (appropriate for small NN) is Xˉ±tN1,0.025s/N\bar{X} \pm t_{N-1, 0.025} \cdot s/\sqrt{N}, where ss is the sample standard deviation.

Intuition

With NN runs, you know the mean N\sqrt{N} times more precisely than from a single run. To halve your uncertainty, you need 4×4\times as many runs. This is why going from 3 runs to 5 runs is a big improvement (uncertainty drops by 23%), but going from 20 to 50 runs gives less marginal benefit.

Why It Matters

This determines whether a claimed improvement is real. If method A scores 89.2±0.889.2 \pm 0.8 and method B scores 88.5±0.788.5 \pm 0.7 (each over 5 runs), the standard errors are 0.8/50.360.8/\sqrt{5} \approx 0.36 and 0.7/50.310.7/\sqrt{5} \approx 0.31. The gap is 0.70.7 with combined SE 0.47\approx 0.47, giving a t-statistic of 0.7/0.471.50.7/0.47 \approx 1.5. This is not significant at α=0.05\alpha = 0.05 (critical value 2.3\approx 2.3 for 8 df). Many published ML "improvements" would not survive this test.

Failure Mode

Assumes runs are truly independent. If all runs use the same data split and only differ in initialization seed, the variance underestimates the true uncertainty (because data-split variance is not captured). For the most robust estimate, vary both seeds and data splits.

Why Single-Run Results Are Unreliable

A single training run is a single sample from a distribution of possible outcomes. The variance can be surprisingly large:

  • Fine-tuning BERT on small datasets: accuracy can vary by 2-5% across seeds
  • Reinforcement learning: reward can vary by 50% or more across seeds
  • Small datasets amplify variance; large datasets reduce it

If your claimed improvement is 0.5% and your cross-seed variance is 1.0%, your result is noise.

Data Split Hygiene

Definition

Train/Validation/Test Split

The training set is used to fit model parameters. The validation set is used to select hyperparameters and make modeling decisions. The test set is used exactly once to report final performance. Violating this protocol invalidates your reported numbers.

The Cardinal Sin: Touching the Test Set

Every time you evaluate on the test set and then make a decision based on the result (change a hyperparameter, try a different model), you leak information from the test set into your modeling process. The test set is no longer an unbiased estimate of future performance.

Rule: evaluate on the test set exactly once, at the very end.

Data Leakage

Definition

Data Leakage

Data leakage occurs when information from outside the training set (validation data, test data, or future data) influences model training. This inflates reported performance and produces models that fail in deployment.

Common forms of data leakage:

  1. Preprocessing on full data: fitting a scaler or tokenizer on the full dataset (including test) before splitting
  2. Temporal leakage: using future data to predict the past in time-series
  3. Group leakage: splitting data points from the same patient/user/document across train and test sets
  4. Feature leakage: including features derived from the target variable

Configuration Management

Every experiment should be fully specified by a configuration file. This means:

  • All hyperparameters (learning rate, batch size, epochs, etc.)
  • Data paths and preprocessing steps
  • Model architecture details
  • Random seeds
  • Software versions (Python, PyTorch, CUDA)

Use tools like Hydra, ML-Flow, or Weights & Biases to track configurations automatically. Never rely on command-line arguments you will forget.

Levels of Reproducibility

Not all reproducibility is equal. The following table distinguishes three levels, each with increasing requirements and value.

LevelWhat it guaranteesRequirementsCommon failure mode
Script reproducibilitySame code + same machine = same outputFixed seeds, pinned dependencies, deterministic CUDA opsWorks on author's machine, fails elsewhere due to CUDA version mismatch or floating-point non-determinism across GPU architectures
Environment reproducibilitySame code + any compatible machine = same output (within tolerance)Docker container or conda environment with exact versions, documented hardware requirements, tolerance bounds statedContainer builds but produces different numerics on different GPU generations (e.g., A100 vs V100) due to different tensor core behavior
Methodological reproducibilityIndependent implementation of the described method = consistent conclusionsClear algorithmic description, stated assumptions, reported variance, ablation studies isolating each contributionPaper omits critical implementation detail (e.g., learning rate warmup schedule, gradient clipping threshold) that is necessary to match results

Most ML papers achieve at best script reproducibility. Methodological reproducibility is the gold standard because it validates the idea, not just the code.

Configuration Anti-Patterns

Several common practices undermine configuration management:

Hardcoded magic numbers. A learning rate buried inside a training loop is invisible to configuration tracking. Every value that could change between experiments belongs in a config file, not in source code.

Implicit defaults. Framework defaults change between versions. PyTorch 1.x and 2.x have different default behaviors for dropout, weight initialization, and gradient computation. If your config does not explicitly set these values, your experiment depends on the framework version in a way that is not documented.

Incomplete logging. Logging hyperparameters but not the data preprocessing pipeline is a common oversight. If you change how you tokenize text or normalize images, that is a different experiment. The preprocessing hash (a checksum of the preprocessing code and configuration) should be logged alongside model hyperparameters.

Manual overrides. Running a script with --lr 0.001 on the command line and then forgetting what you used is the most common reproducibility failure in practice. Config files checked into version control solve this. Each experiment corresponds to a git commit plus a config file, and nothing else.

Checkpointing

Save model checkpoints at regular intervals. This serves three purposes:

  1. Recovery: if training crashes at epoch 90, you do not restart from scratch
  2. Model selection: you can pick the best checkpoint by validation performance
  3. Analysis: you can study training dynamics after the fact

Always save the optimizer state alongside model weights. Without it, you cannot resume training correctly.

Code and Data Release

Watch Out

Reproducible does not mean I can re-run my script

Reproducibility means someone else. who is not you, who does not have your machine, who cannot ask you questions. can obtain the same results. This requires: (1) released code that runs without modification, (2) released data or clear instructions to obtain it, (3) a requirements file with pinned dependency versions, (4) documented random seeds, and (5) expected output numbers to verify against.

A reproducibility checklist:

  • Code is version-controlled (git) with a tagged release
  • Dependencies are pinned (requirements.txt or conda environment.yml)
  • Data is publicly available or generation scripts are provided
  • Configuration files for all reported experiments are included
  • README includes instructions to reproduce each table/figure
  • Expected results (numbers) are documented for verification

Common Mistakes

Watch Out

Reporting best run instead of average

If you run 10 seeds and report the best one, you are overfitting to randomness. This is a form of p-hacking: selecting the best from multiple runs inflates your reported metric the same way testing multiple hypotheses and reporting only the significant one inflates your false positive rate. Always report mean ±\pm std. If you also want to report the best run, label it as "best of N" and explain that it is not representative of typical performance.

Watch Out

Tuning on the test set

If you try 50 hyperparameter configurations and pick the one with the best test accuracy, your test accuracy is not a valid estimate of generalization. Use a separate validation set for all hyperparameter decisions.

Watch Out

Forgetting framework non-determinism

Setting a Python random seed is not enough. You must also set NumPy seeds, framework seeds (torch.manual_seed), and enable deterministic CUDA operations. Even then, some operations may have residual non-determinism on GPU.

Summary

  • Reproducibility means someone else can get the same results, not just you
  • Set all random seeds: Python, NumPy, PyTorch/TensorFlow, CUDA
  • Report mean ±\pm std over N3N \geq 3 independent runs
  • Never make decisions based on test set performance
  • Watch for data leakage: preprocessing, temporal, group, and feature leakage
  • Pin all dependency versions and release code with configurations
  • Single-run results are unreliable; variance can be surprisingly large

Exercises

ExerciseCore

Problem

You normalize your features by computing the mean and standard deviation over the entire dataset, then split into train/test. Explain why this is data leakage and how to fix it.

ExerciseCore

Problem

You fine-tune a model with 3 random seeds and get accuracies of 87.2, 89.1, and 85.8. Your competitor reports 88.0 from a single run. Can you claim your method is better or worse?

ExerciseAdvanced

Problem

You have a medical dataset where each patient has multiple scans. You randomly split scans into train and test sets. Why is this problematic, and how should you split instead?

References

Canonical:

  • Pineau et al., "The Machine Learning Reproducibility Checklist" (2020)
  • Henderson et al., "Deep Reinforcement Learning that Matters" (2018)

Current:

  • Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (2021)

  • Dodge et al., "Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping" (2020)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Last reviewed: April 2026

Builds on This

Next Topics