Skip to main content

Infrastructure

Git and GitLab for ML Research

Reference for version control in ML projects: branching, rebase, monorepo vs multirepo, GitLab CI for training, Git LFS vs DVC for data, and why most ML teams use GitHub plus a model registry.

CoreTier 3Current~12 min
0

What It Is

Git is a distributed version control system, written by Linus Torvalds in 2005. Every clone is a full repository: the entire history, every branch, every commit lives on disk. Operations are local and fast; pushing to a remote (origin, typically) is how clones synchronize. The unit of change is a commit, identified by a SHA-1 hash of its contents plus its parents.

The branching model: a branch is a movable pointer to a commit, merge combines two branches and produces a merge commit with two parents, rebase replays a sequence of commits onto a different base and rewrites their hashes. Conflict resolution happens when two branches modify the same file region; Git marks the conflict with <<<<<<<, =======, >>>>>>> and waits for a manual edit.

GitLab and GitHub are competing hosted Git platforms with built-in CI/CD, code review, and issue tracking. GitLab self-hosts more cleanly (the Omnibus install) and bundles Container Registry and CI runners; GitHub is dominant in open source and has the larger third-party action ecosystem. Both run pipelines from a YAML file in the repo (.gitlab-ci.yml or .github/workflows/*.yml).

When You'd Use It

For ML research, the recurring decisions are repository layout, large-file handling, and CI scope.

Monorepo vs multirepo: a monorepo (one repo for data pipelines, training code, evaluation harness, deployment scripts) wins for cross-cutting refactors and atomic commits across components. Multirepo wins when teams have separate release cadences. ML research projects under 5 contributors almost always benefit from a monorepo.

Large files: Git is bad at storing binary files because it stores full snapshots, not deltas, and the pack format degrades with binary data. Git LFS replaces large files with text pointers and stores blobs on a separate server. LFS works for occasional binaries (logos, small fixtures) but is the wrong tool for model checkpoints or training data: bandwidth costs add up, history rewriting is painful, and there is no concept of dataset versioning.

DVC (Data Version Control, by Iterative.ai) is purpose-built for ML artifacts. It stores a small .dvc file in Git that points to a content-addressed blob in object storage (S3, GCS, Azure Blob, SSH). dvc repro defines a DAG of pipeline stages with input/output hashes, so re-running only re-executes stages whose inputs changed. For any project with model checkpoints over ~100 MB or datasets that change, DVC is the right answer.

GitLab CI for ML training: GitLab CI can launch GPU runners and run training as a pipeline stage, useful for nightly evaluation and integration tests. As of 2025-2026 most production ML teams have moved to a split: GitHub for code review and unit-test CI, and a model registry / orchestrator (W&B Model Registry, MLflow Model Registry, Hugging Face Hub, Modal, Argo Workflows) for actual training and deployment. GitLab CI runners do not handle multi-day jobs gracefully and model lineage tracking is weak.

Pre-commit hooks: the pre-commit framework (a Python tool) runs linters, formatters, and secret scanners before each commit. Standard ML setup: ruff (lint and format), mypy (type check), nbstripout (strip notebook outputs to keep diffs reviewable), and detect-secrets or trufflehog for credential scanning.

Notable Gotchas

Watch Out

git rebase rewrites hashes

Rebasing a branch that someone else has pulled creates two parallel histories and a recurring "your branch and origin have diverged" message. Rule of thumb: rebase only branches that exist on your machine. Once a branch is shared, use merge commits. Force-push (--force-with-lease) on a personal feature branch is fine; force-push on main is destructive.

Watch Out

Notebook diffs make code review painful

A .ipynb file is JSON containing both code and outputs (including base64-encoded images). A one-line code change can produce a 10000-line diff. Always run nbstripout --install in any repo that hosts notebooks, or use Jupytext to pair notebooks with .py files that are the version-controlled artifact.

References

Related Topics

Last reviewed: April 18, 2026

Next Topics