Skip to main content

Infrastructure

Weights and Biases for Experiment Tracking

Practitioner reference for wandb: runs, sweeps, artifacts, hyperparameter search modes, and how it compares to MLflow, Neptune, and TensorBoard.

CoreTier 3Current~12 min
0

What It Is

Weights and Biases (wandb) is a hosted experiment-tracking service built by the company of the same name, founded in 2017 by Lukas Biewald and Chris Van Pelt. The Python client logs scalars, gradients, system metrics, media, and arbitrary artifacts to a cloud workspace; a web UI compares runs across hyperparameters and metrics in real time.

The core unit is a run: one execution of a training script, identified by a generated id and grouped under a project. Runs can be tagged, joined into groups (e.g. one group per distributed-training job), and gathered into reports for write-ups. Beyond plain logging, wandb provides three layered features: Sweeps (hyperparameter search controllers), Artifacts (versioned datasets and model checkpoints with lineage), and Workspaces (saved chart layouts shared across a team).

The category is hosted ML observability. Direct competitors are MLflow (open-source, self-hosted, owned by Databricks), Neptune.ai (hosted, lighter UI, stronger metadata model), Comet, and TensorBoard (local, no cross-run UI without TensorBoard.dev which Google sunset in 2023).

When You'd Use It

Use wandb when a project has more than one collaborator, runs more than ten experiments per week, or needs side-by-side parameter-vs-metric comparison. It is also the path of least resistance for distributed training: a single wandb.init per process plus wandb.log calls produces unified per-rank charts.

Anti-patterns: do not use wandb as a private long-term metric store on the free tier (academic and personal projects only get 100 GB and runs older than ~2 years on free plans can become read-throttled). Do not log every tensor every step on a fast training loop; the HTTP backoff will throttle the run. For purely local debugging where you need a chart in the next 30 seconds, TensorBoard or matplotlib is faster.

Sweep configs support four search modes: grid (full Cartesian product), random (uniform over the search space), bayes (Gaussian-process surrogate over a defined metric), and hyperband (early-stopping bracket search, useful when training cost dominates). Bayesian sweeps need a numeric goal and a metric name in the YAML; getting either wrong silently produces random search.

Common logging patterns worth memorizing: wandb.watch(model, log="all", log_freq=100) for parameter and gradient histograms, wandb.log({"grad_norm": total_grad_norm}) for stability monitoring, and wandb.Artifact("dataset", type="dataset") for dataset versioning so a run links back to the exact data hash it consumed.

Notable Gotchas

Watch Out

Free tier and public projects

The wandb free tier requires projects to be public. Many users discover this only after logging proprietary hyperparameters. The "personal" plan keeps projects private but caps storage; "teams" pricing scales per seat plus storage. Always set WANDB_MODE=offline for sensitive runs you have not yet decided to upload, then wandb sync later.

Watch Out

Sweeps run agents, not jobs

A wandb sweep does not launch compute; it generates configurations that an agent polls. Forgetting to actually start the agent on a GPU box leaves the sweep stuck in "pending" forever. For multi-GPU sweeps run one agent per device with CUDA_VISIBLE_DEVICES.

References

Related Topics

Last reviewed: April 18, 2026

Next Topics