Weights and Biases for Experiment Tracking

Weights and Biases (wandb) is a hosted experiment-tracking service built by the company of the same name, founded in 2017 by Lukas Biewald and Chris Van Pelt. The Python client logs scalars, gradients, system metrics, media, and arbitrary artifacts to a cloud workspace; a web UI compares runs across hyperparameters and metrics in real time.

The core unit is a run: one execution of a training script, identified by a generated id and grouped under a project. Runs can be tagged, joined into groups (e.g. one group per distributed-training job), and gathered into reports for write-ups. Beyond plain logging, wandb provides three layered features: Sweeps (hyperparameter search controllers), Artifacts (versioned datasets and model checkpoints with lineage), and Workspaces (saved chart layouts shared across a team).

The category is hosted ML observability. Direct competitors are MLflow (open-source, self-hosted, owned by Databricks), Neptune.ai (hosted, lighter UI, stronger metadata model), Comet, and TensorBoard (local, no cross-run UI without TensorBoard.dev which Google sunset in 2023).

When You'd Use It

Use wandb when a project has more than one collaborator, runs more than ten experiments per week, or needs side-by-side parameter-vs-metric comparison. It is also the path of least resistance for distributed training: a single wandb.init per process plus wandb.log calls produces unified per-rank charts.

Anti-patterns: do not use wandb as a private long-term metric store on the free tier (academic and personal projects only get 100 GB and runs older than ~2 years on free plans can become read-throttled). Do not log every tensor every step on a fast training loop; the HTTP backoff will throttle the run. For purely local debugging where you need a chart in the next 30 seconds, TensorBoard or matplotlib is faster.

Sweep configs support four search modes: grid (full Cartesian product), random (uniform over the search space), bayes (Gaussian-process surrogate over a defined metric), and hyperband (early-stopping bracket search, useful when training cost dominates). Bayesian sweeps need a numeric goal and a metric name in the YAML; getting either wrong silently produces random search.

Common logging patterns worth memorizing: wandb.watch(model, log="all", log_freq=100) for parameter and gradient histograms, wandb.log({"grad_norm": total_grad_norm}) for stability monitoring, and wandb.Artifact("dataset", type="dataset") for dataset versioning so a run links back to the exact data hash it consumed.

Definition

Experiment Run

An experiment run is one execution of a training or evaluation program with a fixed code version, config, data identity, random seed policy, hardware context, and output metrics. Without those fields, a run is just a chart point.

Proposition

Run Comparability Principle

Statement

Two experiment runs can be compared only when their metric, data, and execution context are aligned enough that the measured difference is attributable to the intended change.

Intuition

Experiment tracking is not storage decoration. It turns a pile of runs into evidence by preserving what changed and what stayed fixed.

Failure Mode

Comparisons become misleading when a dataset version changes, a preprocessing bug is fixed mid-sweep, or runs log metrics with the same name but different definitions.

report a correction →

ExerciseCore

Problem

You changed the learning rate and the data-cleaning script in the same sweep. The best run improved by 2 points. Why is the run comparison weak?

Notable Gotchas

Watch Out

Free tier and public projects

The wandb free tier requires projects to be public. Many users discover this only after logging proprietary hyperparameters. The "personal" plan keeps projects private but caps storage; "teams" pricing scales per seat plus storage. Always set WANDB_MODE=offline for sensitive runs you have not yet decided to upload, then wandb sync later.

Watch Out

Sweeps run agents, not jobs

A wandb sweep does not launch compute; it generates configurations that an agent polls. Forgetting to actually start the agent on a GPU box leaves the sweep stuck in "pending" forever. For multi-GPU sweeps run one agent per device with CUDA_VISIBLE_DEVICES.

References

Weights and Biases Documentation (https://docs.wandb.ai).
Weights and Biases Sweeps configuration reference (https://docs.wandb.ai/guides/sweeps/define-sweep-configuration).
MLflow Documentation (https://mlflow.org/docs/latest/index.html), for comparison.
Biewald, L. "Experiment Tracking with Weights and Biases" (2020), wandb technical report.
Neptune vs. wandb vs. MLflow comparison (https://neptune.ai/vs), vendor-authored, useful for feature matrices despite obvious bias.

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

2

Reproducibility and Experimental Rigorlayer 2 · tier 2
Experiment Tracking and Toolinglayer 2 · tier 3

Graph-backed continuations

Experiment Tracking and Tooling Reproducibility and Experimental Rigor

Read this page in the graph.

What It Is