Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Experiment Tracking and Tooling

MLflow, Weights and Biases, TensorBoard, Hydra, and the discipline of recording everything. What to log, how to compare experiments, and why you cannot reproduce what you did not record.

CoreTier 3Current~35 min
0

Why This Matters

You cannot reproduce what you did not record. Every ML practitioner has faced this situation: a model from three months ago performed well, but nobody remembers the exact hyperparameters, data version, or code commit that produced it. Experiment tracking tools solve this by automatically recording every detail of every run.

The difference between a research group that can reproduce its own results and one that cannot is almost always tooling discipline, not talent.

Mental Model

An experiment tracker is a structured lab notebook. For each run, it records: the inputs (code, data, config), the process (training curves, resource usage), and the outputs (metrics, artifacts, model checkpoints). Given any past result, you should be able to look up the exact conditions that produced it and re-run them.

What to Log

Definition

Experiment Record

A complete experiment record contains the following for each training run:

  1. Hyperparameters: learning rate, batch size, optimizer settings, architecture choices, regularization strength, number of epochs
  2. Metrics over time: training loss, validation loss, evaluation metrics at each logging step
  3. Data version: hash or identifier for the exact dataset used, including preprocessing
  4. Code version: git commit hash of the training code
  5. Environment: Python version, library versions, GPU type, CUDA version
  6. Artifacts: saved model checkpoints, generated outputs, evaluation predictions
  7. Random seeds: all seeds used for reproducibility

Major Tracking Platforms

MLflow

MLflow provides four components: Tracking (logging parameters, metrics, artifacts), Projects (packaging code for reproducibility), Models (model registry), and Model Serving. The tracking server stores runs in a backend database with an artifact store (local filesystem or S3).

Key design choice: MLflow is open-source and self-hosted. You own your data. The tradeoff is that you manage the infrastructure.

Weights and Biases (W&B)

W&B is a hosted platform that logs metrics, hyperparameters, system metrics (GPU utilization, memory), and artifacts. It provides interactive dashboards for comparing runs, a sweep agent for hyperparameter search, and a report system for sharing results.

Key design choice: W&B is hosted (with a self-hosted option). The hosted version requires sending data to external servers, which matters for proprietary work.

TensorBoard

TensorBoard is a visualization tool that reads event files written during training. It supports scalar metrics, histograms, images, text, and computation graphs. It is tightly integrated with TensorFlow and has PyTorch support via torch.utils.tensorboard.

Key design choice: TensorBoard is a local visualization tool, not a full tracking platform. It lacks built-in experiment comparison, hyperparameter logging, and artifact management. It is useful for monitoring a single run but insufficient for managing a research program.

Configuration Management

Definition

Configuration Management

Configuration management is the practice of specifying all experiment parameters in structured config files rather than command-line arguments or hardcoded values. Tools like Hydra and OmegaConf provide:

  1. Hierarchical configs: nested YAML files for model, data, training, and evaluation settings
  2. Config composition: combine partial configs (e.g., model=resnet + optimizer=adam)
  3. Override from command line: change any parameter without editing files
  4. Automatic logging: the resolved config is saved with each run

Hydra (by Meta) is the standard. It creates a timestamped output directory for each run, saves the full resolved config, and integrates with logging frameworks. The key discipline: never specify a hyperparameter in code that is not also in the config file.

Experiment Comparison

Proposition

Tracking Completeness Principle

Statement

An experiment tracking system is complete if, for any two runs AA and BB with different outcomes (e.g., different final metrics), the system contains sufficient information to identify at least one difference in inputs (hyperparameters, data, code, or random seed) that explains the outcome difference. Formally: if metric(A)metric(B)\text{metric}(A) \neq \text{metric}(B), then \exists recorded parameter θ\theta such that θAθB\theta_A \neq \theta_B.

Intuition

If two runs produce different results and you cannot find any recorded difference between them, your tracking is incomplete. Something changed that you did not log. Common culprits: unrecorded library version changes, floating-point nondeterminism on GPU, or untracked data preprocessing changes.

Why It Matters

Completeness is what separates useful tracking from checkbox tracking. Logging learning rate and batch size is not enough if the data preprocessing pipeline changed between runs. The goal is total accountability: every difference in output can be traced to a difference in input.

Failure Mode

Perfect completeness is unachievable in practice. GPU floating-point operations are nondeterministic, library internals change between minor versions, and some randomness is irreducible. The practical standard is: log enough to reproduce results within the variance you have measured and documented.

Hyperparameter Sweep Management

Tracking tools typically integrate sweep (hyperparameter search) functionality:

  1. Grid search: enumerate all combinations. Logged as a group of runs with a shared sweep ID
  2. Random search: sample configurations from distributions. More efficient than grid for high-dimensional spaces (Bergstra and Bengio, 2012)
  3. Bayesian optimization: use past results to guide future configurations. W&B Sweeps and Optuna support this
  4. Early stopping: kill underperforming runs early. Requires real-time metric access, which tracking tools provide

The sweep metadata (search space, sampling strategy, stopping criteria) should be logged alongside the individual runs.

Common Confusions

Watch Out

Logging is not the same as tracking

Writing print statements to stdout is logging. Tracking means structured storage that supports querying, comparison, and retrieval. If you cannot programmatically find the run with the best validation loss from last month, you have logging but not tracking.

Watch Out

Version control is not experiment tracking

Git tracks code versions. Experiment tracking records the mapping from (code version, data version, config) to (metrics, artifacts). You need both. A git commit tells you what code was available; an experiment record tells you which configuration of that code produced which result.

Watch Out

Dashboards are not a substitute for raw logs

Interactive dashboards are useful for exploration but unreliable for archival. Always ensure the underlying data (metrics, configs, artifacts) is stored in a durable format that survives platform migrations. Export to JSON or CSV periodically.

Summary

  • Log everything: hyperparameters, metrics over time, data version, code commit, environment, seeds
  • Use structured config management (Hydra/OmegaConf), not command-line arguments
  • MLflow is open-source and self-hosted; W&B is hosted with richer visualization
  • TensorBoard is a visualization tool, not a complete tracking platform
  • Sweep metadata (search space, strategy) should be tracked alongside runs
  • You cannot reproduce what you did not record

Exercises

ExerciseCore

Problem

You trained a model three months ago that achieved 94% accuracy. You want to reproduce it. You have the code in git and the final accuracy logged. What additional information do you need, and which of it would a proper tracking system have recorded?

ExerciseAdvanced

Problem

You are choosing between MLflow (self-hosted) and W&B (hosted) for a team of 10 ML engineers working on proprietary medical data. List three specific technical factors that should influence this decision, beyond general preference.

References

Canonical:

  • Zaharia et al., "Accelerating the Machine Learning Lifecycle with MLflow" (IEEE Data Engineering Bulletin, 2018)
  • Bergstra and Bengio, "Random Search for Hyper-Parameter Optimization" (JMLR 2012)

Current:

  • Biewald, "Experiment Tracking with Weights and Biases" (2020)

  • Yadan, "Hydra: A Framework for Elegantly Configuring Complex Applications" (2019)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics