Methodology
Experiment Tracking and Tooling
MLflow, Weights and Biases, TensorBoard, Hydra, and the discipline of recording everything. What to log, how to compare experiments, and why you cannot reproduce what you did not record.
Prerequisites
Why This Matters
You cannot reproduce what you did not record. Every ML practitioner has faced this situation: a model from three months ago performed well, but nobody remembers the exact hyperparameters, data version, or code commit that produced it. Experiment tracking tools solve this by automatically recording every detail of every run.
The difference between a research group that can reproduce its own results and one that cannot is almost always tooling discipline, not talent.
Mental Model
An experiment tracker is a structured lab notebook. For each run, it records: the inputs (code, data, config), the process (training curves, resource usage), and the outputs (metrics, artifacts, model checkpoints). Given any past result, you should be able to look up the exact conditions that produced it and re-run them.
What to Log
Experiment Record
A complete experiment record contains the following for each training run:
- Hyperparameters: learning rate, batch size, optimizer settings, architecture choices, regularization strength, number of epochs
- Metrics over time: training loss, validation loss, evaluation metrics at each logging step
- Data version: hash or identifier for the exact dataset used, including preprocessing
- Code version: git commit hash of the training code
- Environment: Python version, library versions, GPU type, CUDA version
- Artifacts: saved model checkpoints, generated outputs, evaluation predictions
- Random seeds: all seeds used for reproducibility
Major Tracking Platforms
MLflow
MLflow provides four components: Tracking (logging parameters, metrics, artifacts), Projects (packaging code for reproducibility), Models (model registry), and Model Serving. The tracking server stores runs in a backend database with an artifact store (local filesystem or S3).
Key design choice: MLflow is open-source and self-hosted. You own your data. The tradeoff is that you manage the infrastructure.
Weights and Biases (W&B)
W&B is a hosted platform that logs metrics, hyperparameters, system metrics (GPU utilization, memory), and artifacts. It provides interactive dashboards for comparing runs, a sweep agent for hyperparameter search, and a report system for sharing results.
Key design choice: W&B is hosted (with a self-hosted option). The hosted version requires sending data to external servers, which matters for proprietary work.
TensorBoard
TensorBoard is a visualization tool that reads event files written during
training. It supports scalar metrics, histograms, images, text, and computation
graphs. It is tightly integrated with TensorFlow and has PyTorch support via
torch.utils.tensorboard.
Key design choice: TensorBoard is a local visualization tool, not a full tracking platform. It lacks built-in experiment comparison, hyperparameter logging, and artifact management. It is useful for monitoring a single run but insufficient for managing a research program.
Configuration Management
Configuration Management
Configuration management is the practice of specifying all experiment parameters in structured config files rather than command-line arguments or hardcoded values. Tools like Hydra and OmegaConf provide:
- Hierarchical configs: nested YAML files for model, data, training, and evaluation settings
- Config composition: combine partial configs (e.g., model=resnet + optimizer=adam)
- Override from command line: change any parameter without editing files
- Automatic logging: the resolved config is saved with each run
Hydra (by Meta) is the standard. It creates a timestamped output directory for each run, saves the full resolved config, and integrates with logging frameworks. The key discipline: never specify a hyperparameter in code that is not also in the config file.
Experiment Comparison
Tracking Completeness Principle
Statement
An experiment tracking system is complete if, for any two runs and with different outcomes (e.g., different final metrics), the system contains sufficient information to identify at least one difference in inputs (hyperparameters, data, code, or random seed) that explains the outcome difference. Formally: if , then recorded parameter such that .
Intuition
If two runs produce different results and you cannot find any recorded difference between them, your tracking is incomplete. Something changed that you did not log. Common culprits: unrecorded library version changes, floating-point nondeterminism on GPU, or untracked data preprocessing changes.
Why It Matters
Completeness is what separates useful tracking from checkbox tracking. Logging learning rate and batch size is not enough if the data preprocessing pipeline changed between runs. The goal is total accountability: every difference in output can be traced to a difference in input.
Failure Mode
Perfect completeness is unachievable in practice. GPU floating-point operations are nondeterministic, library internals change between minor versions, and some randomness is irreducible. The practical standard is: log enough to reproduce results within the variance you have measured and documented.
Hyperparameter Sweep Management
Tracking tools typically integrate sweep (hyperparameter search) functionality:
- Grid search: enumerate all combinations. Logged as a group of runs with a shared sweep ID
- Random search: sample configurations from distributions. More efficient than grid for high-dimensional spaces (Bergstra and Bengio, 2012)
- Bayesian optimization: use past results to guide future configurations. W&B Sweeps and Optuna support this
- Early stopping: kill underperforming runs early. Requires real-time metric access, which tracking tools provide
The sweep metadata (search space, sampling strategy, stopping criteria) should be logged alongside the individual runs.
Common Confusions
Logging is not the same as tracking
Writing print statements to stdout is logging. Tracking means structured storage that supports querying, comparison, and retrieval. If you cannot programmatically find the run with the best validation loss from last month, you have logging but not tracking.
Version control is not experiment tracking
Git tracks code versions. Experiment tracking records the mapping from (code version, data version, config) to (metrics, artifacts). You need both. A git commit tells you what code was available; an experiment record tells you which configuration of that code produced which result.
Dashboards are not a substitute for raw logs
Interactive dashboards are useful for exploration but unreliable for archival. Always ensure the underlying data (metrics, configs, artifacts) is stored in a durable format that survives platform migrations. Export to JSON or CSV periodically.
Summary
- Log everything: hyperparameters, metrics over time, data version, code commit, environment, seeds
- Use structured config management (Hydra/OmegaConf), not command-line arguments
- MLflow is open-source and self-hosted; W&B is hosted with richer visualization
- TensorBoard is a visualization tool, not a complete tracking platform
- Sweep metadata (search space, strategy) should be tracked alongside runs
- You cannot reproduce what you did not record
Exercises
Problem
You trained a model three months ago that achieved 94% accuracy. You want to reproduce it. You have the code in git and the final accuracy logged. What additional information do you need, and which of it would a proper tracking system have recorded?
Problem
You are choosing between MLflow (self-hosted) and W&B (hosted) for a team of 10 ML engineers working on proprietary medical data. List three specific technical factors that should influence this decision, beyond general preference.
References
Canonical:
- Zaharia et al., "Accelerating the Machine Learning Lifecycle with MLflow" (IEEE Data Engineering Bulletin, 2018)
- Bergstra and Bengio, "Random Search for Hyper-Parameter Optimization" (JMLR 2012)
Current:
-
Biewald, "Experiment Tracking with Weights and Biases" (2020)
-
Yadan, "Hydra: A Framework for Elegantly Configuring Complex Applications" (2019)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
Next Topics
- Ablation study design: systematic experiments to understand component contributions
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.