Docker and Containers for ML

Sneiderman, Robby

Infrastructure

Docker and Containers for ML

Docker fundamentals for ML practitioners: image, layer cache, multi-stage build, GPU passthrough via NVIDIA Container Toolkit, and the security gotchas (root user, secrets in layers).

CoreTier 3CurrentReference~12 min

Prereq Map

Learning position

Read this page in the graph.

infrastructure | layer 4 | tier 3. This page has 0 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

GPU Compute Model

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

What It Is

A Docker container is a process running in isolated Linux namespaces (PID, network, mount, UTS, IPC, user) with cgroup-based resource limits, started from a read-only filesystem image. The image is a stack of layers, each a tar of filesystem changes. Layers are content-addressed by SHA-256, so identical layers are cached and shared across images. A Dockerfile is the recipe; each RUN, COPY, or ADD line creates one new layer. The build cache reuses any layer whose inputs (preceding layer plus instruction text) are unchanged, which is why the order of Dockerfile instructions matters for build speed.

Multi-stage builds (FROM ... AS builder followed by a second FROM that COPY --from=builder) separate the build environment from the runtime image. The builder stage installs CUDA toolchains, compilers, and full Python development headers; the runtime stage copies only the compiled wheels and shared libraries it needs. A typical PyTorch training image shrinks from 8 GB to 2 GB this way. BuildKit is the modern Docker build engine (default since Docker 23.0, 2023), adding parallel stage execution, build secrets, and a richer cache-mount API.

Containers vs VMs: a container shares the host kernel and starts in milliseconds; a VM virtualizes hardware and boots a full guest OS in seconds. Containers trade isolation for speed and density. For ML, this matters mostly for one reason: GPU passthrough. The NVIDIA Container Toolkit (formerly nvidia-docker2) installs a runtime hook that mounts the host's NVIDIA driver libraries and /dev/nvidia* device files into the container, so a container built on nvidia/cuda:12.4.0-runtime-ubuntu22.04 can use the host GPU with docker run --gpus all.

Common base images: nvidia/cuda (official, devel and runtime variants per CUDA version), pytorch/pytorch (PyTorch + CUDA preinstalled), python:3.11-slim (~150 MB), gcr.io/distroless/python3 (no shell, minimal attack surface for production serving). Match the CUDA major version of the base image to the host driver.

When You'd Use It

Use a container any time the deployment target is not the laptop the model was trained on: cloud GPU instances, on-prem clusters, CI runners. Bind-mount source code (-v $(pwd):/workspace) for active development so edits are live without rebuilding; use named volumes for persistent data (model checkpoints, datasets). Push images to a registry — Docker Hub, GitHub Container Registry (GHCR), Amazon ECR, or Google Artifact Registry — to share with cluster runtimes (Kubernetes, ECS, Modal).

Notable Gotchas

Watch Out

Secrets in layers leak even after deletion

COPY .env /app/.env followed by RUN rm /app/.env does not remove the secret from the image. The deleted file still exists in the layer that added it; anyone who pulls the image can run docker history and extract it. Use BuildKit's --mount=type=secret,id=mysecret (the secret is mounted only during that RUN and never written to a layer), or pass secrets as runtime environment variables, never bake them into the build.

Watch Out

Containers run as root by default, which is wrong for production

The default USER is root, and a process inside the container with --privileged or with hostpath mounts can affect the host filesystem. Add RUN useradd -m app && USER app to the Dockerfile to drop privileges. Distroless and Chainguard images run as non-root by default. Kubernetes pod security policies and runAsNonRoot: true enforce this at the orchestrator level.

References

Docker, Dockerfile Reference (https://docs.docker.com/reference/dockerfile/).
Docker, Multi-stage Builds (https://docs.docker.com/build/building/multi-stage/).
NVIDIA, NVIDIA Container Toolkit Installation Guide (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).
Merkel, Docker: Lightweight Linux Containers for Consistent Development and Deployment (Linux Journal 239, 2014).
BuildKit, Build Secrets and SSH Forwarding (https://docs.docker.com/build/building/secrets/).

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

2

GPU Compute Modellayer 5 · tier 2
Inference Systems Overviewlayer 5 · tier 2

Graph-backed continuations

GPU Compute Model Inference Systems Overview