Reproducibility and Experimental Rigor

Sneiderman, Robby

Methodology

Reproducibility and Experimental Rigor

What it takes to make ML experiments truly reproducible: seeds, variance reporting, data hygiene, configuration management, and the discipline of multi-run evaluation.

CoreTier 2CurrentSupporting~50 min

Prerequisites

Git and Gitlab for ML Research Python for ML Research Weights and Biases Experiment Tracking

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

methodology | layer 2 | tier 2. This page has 3 direct prerequisites and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Ablation Study Design

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The ML reproducibility crisis is real. Studies have shown that a large fraction of published results cannot be replicated, even by the original authors. If your experiments are not reproducible, your conclusions are not trustworthy.

This is not just an academic concern. If you ship a model to production based on a single lucky run, you will be surprised when retraining produces something worse. Reproducibility connects directly to hypothesis testing: a result that cannot be replicated is, at best, an effect a properly powered statistical test would have a hard time separating from run-to-run noise. (Failure to replicate is not by itself proof that the original effect is null — power, sample size, and reporting choices all matter — but it is strong evidence that the reported point estimate was not robust.) It also intersects with model evaluation, because evaluation metrics are meaningless without controlled experimental conditions.

Mental Model

A reproducible experiment is one where a stranger, given your code, data, and instructions, can obtain the same results you reported. Not approximately the same. The same, within the variance you documented.

This requires controlling every source of randomness, documenting every decision, and reporting results honestly.

Sources of Randomness

ML experiments have many sources of randomness that must be controlled:

Definition

Random Seed

A random seed is an integer that initializes a pseudorandom number generator to a deterministic state. Setting the same seed produces the same sequence of "random" numbers. In ML, you must set seeds for: the language/framework RNG (Python, NumPy), the deep learning framework (PyTorch, TensorFlow), and CUDA operations (which may use non-deterministic algorithms by default).

Key sources of randomness in a typical ML pipeline:

Weight initialization: different initial weights lead to different optima
Data shuffling: the order of training batches affects optimization
Data augmentation: random crops, flips, and noise differ across runs
Dropout: random mask patterns change every forward pass
CUDA non-determinism: some GPU operations (e.g., atomicAdd in reductions) are non-deterministic by default

Multi-Run Evaluation

Definition

Multi-Run Reporting Standard

Report results as mean and standard deviation over $N$ independent runs with different random seeds:

$\text{metric} = \bar{x} \pm s = \frac{1}{N}\sum_{i=1}^{N} x_i \pm \sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(x_i - \bar{x})^2}$

Use $N \geq 3$ at minimum. For claims of small improvements, use $N \geq 5$ . For reinforcement learning, $N = 3$ is not enough: Agarwal et al. (2021) show that RL benchmarks need $N \geq 10$ (ideally more) together with interquartile mean (IQM) and stratified bootstrap confidence intervals to give reliable comparisons, because per-seed score distributions are heavy-tailed and mean $\pm$ std understates uncertainty. The standard error of the mean is $s / \sqrt{N}$ , which determines how precisely you know the true mean. This connects to the central limit theorem: the sample mean is approximately normal for large $N$ , justifying the confidence interval construction.

Proposition

Standard Error of the Mean

Statement

Let $X_1, \ldots, X_N$ be independent experiment runs with mean $\mu$ and variance $\sigma^2$ . The sample mean $\bar{X} = \frac{1}{N}\sum X_i$ satisfies:

$\text{Var}(\bar{X}) = \frac{\sigma^2}{N}, \quad \text{SE}(\bar{X}) = \frac{\sigma}{\sqrt{N}}$

A 95% confidence interval for $\mu$ using the $t$ -distribution (appropriate for small $N$ ) is $\bar{X} \pm t_{N-1, 0.025} \cdot s/\sqrt{N}$ , where $s$ is the sample standard deviation.

Intuition

With $N$ runs, you know the mean $\sqrt{N}$ times more precisely than from a single run. To halve your uncertainty, you need $4\times$ as many runs. Going from $N=3$ to $N=5$ drops the SE by a factor $\sqrt{3/5} \approx 0.775$ , i.e. about 22%. Going from 20 to 50 runs gives less marginal benefit.

Why It Matters

This determines whether a claimed improvement is real. If method A scores $89.2 \pm 0.8$ (sample std) and method B scores $88.5 \pm 0.7$ (each over $n_A = n_B = 5$ runs), the two-sample SE for the mean difference is $\sqrt{s_A^2/n_A + s_B^2/n_B} = \sqrt{0.64/5 + 0.49/5} = \sqrt{0.226} \approx 0.475$ . The gap is $0.7$ , giving a Welch $t$ -statistic of $0.7/0.475 \approx 1.47$ . This is not significant at $\alpha = 0.05$ (two-sided critical value $\approx 2.31$ for $\approx 8$ df). Many published ML "improvements" would not survive this test.

Failure Mode

Assumes runs are truly independent. If all runs use the same data split and only differ in initialization seed, the variance underestimates true uncertainty because data-split variance is not captured. Bouthillier et al. (2021) decompose total benchmark variance into five sources: weight initialization, data shuffling order, data split, hyperparameter choice, and implementation/library version. Varying only the initialization seed captures one of five components and typically gives the smallest of them. For the most robust estimate, vary seed and data split, and if feasible also the implementation stack.

report a correction →

Why Single-Run Results Are Unreliable

A single training run is a single sample from a distribution of possible outcomes. The variance can be surprisingly large:

Fine-tuning BERT on small datasets: accuracy can vary by 2-5% across seeds
Reinforcement learning: reward can vary by 50% or more across seeds
Small datasets amplify variance; large datasets reduce it

If your claimed improvement is 0.5% and your cross-seed variance is 1.0%, your result is noise.

Data Split Hygiene

Definition

Train/Validation/Test Split

The training set is used to fit model parameters. The validation set is used to select hyperparameters and make modeling decisions. The test set is used exactly once to report final performance. Violating this protocol invalidates your reported numbers.

The Cardinal Sin: Touching the Test Set

Every time you evaluate on the test set and then make a decision based on the result (change a hyperparameter, try a different model), you leak information from the test set into your modeling process. The test set is no longer an unbiased estimate of future performance.

Rule: evaluate on the test set exactly once, at the very end.

Data Leakage

Definition

Data Leakage

Data leakage occurs when information from outside the training set (validation data, test data, or future data) influences model training. This inflates reported performance and produces models that fail in deployment.

Common forms of data leakage:

Preprocessing on full data: fitting a scaler or tokenizer on the full dataset (including test) before splitting
Temporal leakage: using future data to predict the past in time-series
Group leakage: splitting data points from the same patient/user/document across train and test sets
Feature leakage: including features derived from the target variable

Configuration Management

Every experiment should be fully specified by a configuration file. This means:

All hyperparameters (learning rate, batch size, epochs, etc.)
Data paths and preprocessing steps
Model architecture details
Random seeds
Software versions (Python, PyTorch, CUDA)

Use tools like Hydra, ML-Flow, or Weights & Biases to track configurations automatically. Never rely on command-line arguments you will forget.

Levels of Reproducibility

Not all reproducibility is equal. The following table distinguishes three levels, each with increasing requirements and value.

Level	What it guarantees	Requirements	Common failure mode
Script reproducibility	Same code + same machine = same output	Fixed seeds, pinned dependencies, deterministic CUDA ops	Works on author's machine, fails elsewhere due to CUDA version mismatch or floating-point non-determinism across GPU architectures
Environment reproducibility	Same code + any compatible machine = same output (within tolerance)	Docker container or conda environment with exact versions, documented hardware requirements, tolerance bounds stated	Container builds but produces different numerics on different GPU generations (e.g., A100 vs V100) due to different tensor core behavior
Methodological reproducibility	Independent implementation of the described method = consistent conclusions	Clear algorithmic description, stated assumptions, reported variance, ablation studies isolating each contribution	Paper omits critical implementation detail (e.g., learning rate warmup schedule, gradient clipping threshold) that is necessary to match results

Most ML papers achieve at best script reproducibility. Methodological reproducibility is the gold standard because it validates the idea, not just the code.

Configuration Anti-Patterns

Several common practices undermine configuration management:

Hardcoded magic numbers. A learning rate buried inside a training loop is invisible to configuration tracking. Every value that could change between experiments belongs in a config file, not in source code.

Implicit defaults. Framework defaults change between versions. PyTorch 1.x and 2.x have different default behaviors for dropout, weight initialization, and gradient computation. If your config does not explicitly set these values, your experiment depends on the framework version in a way that is not documented.

Incomplete logging. Logging hyperparameters but not the data preprocessing pipeline is a common oversight. If you change how you tokenize text or normalize images, that is a different experiment. The preprocessing hash (a checksum of the preprocessing code and configuration) should be logged alongside model hyperparameters.

Manual overrides. Running a script with --lr 0.001 on the command line and then forgetting what you used is the most common reproducibility failure in practice. Config files checked into version control solve this. Each experiment corresponds to a git commit plus a config file, and nothing else.

Checkpointing

Save model checkpoints at regular intervals. This serves three purposes:

Recovery: if training crashes at epoch 90, you do not restart from scratch
Model selection: you can pick the best checkpoint by validation performance
Analysis: you can study training dynamics after the fact

Always save the optimizer state alongside model weights. Without it, you cannot resume training correctly.

Code and Data Release

Watch Out

Reproducible does not mean I can re-run my script

Reproducibility means someone else, who is not you, who does not have your machine, who cannot ask you questions, can obtain the same results. This requires: (1) released code that runs without modification, (2) released data or clear instructions to obtain it, (3) a requirements file with pinned dependency versions, (4) documented random seeds, and (5) expected output numbers to verify against.

A reproducibility checklist:

Code is version-controlled (git) with a tagged release
Dependencies are pinned (requirements.txt or conda environment.yml)
Data is publicly available or generation scripts are provided
Configuration files for all reported experiments are included
README includes instructions to reproduce each table/figure
Expected results (numbers) are documented for verification

Common Mistakes

Watch Out

Reporting best run instead of average

If you run 10 seeds and report the best one, you are overfitting to randomness. This is a form of p-hacking: selecting the best from multiple runs inflates your reported metric the same way testing multiple hypotheses and reporting only the significant one inflates your false positive rate. Always report mean $\pm$ std. If you also want to report the best run, label it as "best of N" and explain that it is not representative of typical performance.

Watch Out

Tuning on the test set

If you try 50 hyperparameter configurations and pick the one with the best test accuracy, your test accuracy is not a valid estimate of generalization. Use a separate validation set for all hyperparameter decisions.

Watch Out

Forgetting framework non-determinism

Setting a Python random seed is not enough. You must also set NumPy seeds, framework seeds (torch.manual_seed), and enable deterministic CUDA operations. Even then, some operations may have residual non-determinism on GPU. A minimal PyTorch-side snippet is:

import os, random, numpy as np, torch

def set_seed(seed: int = 0):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
    torch.use_deterministic_algorithms(True, warn_only=True)
    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"

Caveats: use_deterministic_algorithms(True) raises on ops without a deterministic implementation, which is why production recipes typically use warn_only=True. Different GPU architectures (A100 vs H100) can still produce slightly different numerics, and multi-GPU reductions often use non-deterministic order unless you set NCCL flags. Picard (2021, "torch.manual_seed(3407) is all you need") demonstrated empirically that seed-induced variance can dominate architectural differences on standard vision benchmarks. Madhyastha and Jain (2019) document similar seed sensitivity in NLP.

Summary

Reproducibility means someone else can get the same results, not just you
Set all random seeds: Python, NumPy, PyTorch/TensorFlow, CUDA
Report mean $\pm$ std over $N \geq 3$ independent runs
Never make decisions based on test set performance
Watch for data leakage: preprocessing, temporal, group, and feature leakage
Pin all dependency versions and release code with configurations
Single-run results are unreliable; variance can be surprisingly large

Exercises

ExerciseCore

Problem

You normalize your features by computing the mean and standard deviation over the entire dataset, then split into train/test. Explain why this is data leakage and how to fix it.

ExerciseCore

Problem

You fine-tune a model with 3 random seeds and get accuracies of 87.2, 89.1, and 85.8. Your competitor reports 88.0 from a single run. Can you claim your method is better or worse?

ExerciseAdvanced

Problem

You have a medical dataset where each patient has multiple scans. You randomly split scans into train and test sets. Why is this problematic, and how should you split instead?

References

Canonical:

Pineau et al., "The Machine Learning Reproducibility Checklist" (2020)
Henderson et al., "Deep Reinforcement Learning that Matters" (2018)

Current:

Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (MLSys 2021)
Dodge et al., "Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping" (2020)
Agarwal et al., "Deep Reinforcement Learning at the Edge of the Statistical Precipice" (NeurIPS 2021, IQM + stratified bootstrap)
Picard, "torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision" (2021)
Madhyastha and Jain, "On Model Stability as a Function of Random Seed" (CoNLL 2019)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Git and GitLab for ML Researchlayer 4 · tier 3
Python for ML Researchlayer 4 · tier 3
Weights and Biases for Experiment Trackinglayer 4 · tier 3

Derived topics

4

Hypothesis Testing for MLlayer 2 · tier 2
Ablation Study Designlayer 3 · tier 2
Experiment Tracking and Toolinglayer 2 · tier 3
Benchmarking Methodologylayer 3 · tier 3

Graph-backed continuations

Ablation Study Design Hypothesis Testing for ML Benchmarking Methodology Experiment Tracking and Tooling