Kubernetes for ML Workloads

Sneiderman, Robby

Infrastructure

Kubernetes for ML Workloads

Kubernetes pod / deployment / service model, GPU scheduling via device plugins, and the ML-specific layers (Kubeflow, KServe, Volcano). Standard for production model serving; overkill for solo research.

CoreTier 3CurrentReference~12 min

Prereq Map

Learning position

Read this page in the graph.

infrastructure | layer 4 | tier 3. This page has 0 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Inference Systems Overview

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

What It Is

Kubernetes is a cluster orchestrator that runs containerized workloads across a fleet of machines. The core abstractions: a pod is one or more containers scheduled together on a node; a deployment is a controller that maintains a target replica count of identical pods; a service is a stable virtual IP and DNS name that load-balances across the pods behind it; a namespace is a logical partition for naming, RBAC, and quota. The control plane (API server, scheduler, controller manager, etcd) reconciles desired state from YAML manifests against observed cluster state. kubectl is the CLI; kubectl apply -f deploy.yaml and kubectl get pods are the two commands you run constantly.

GPU scheduling works through the device plugin interface. The NVIDIA k8s-device-plugin runs as a DaemonSet, advertises nvidia.com/gpu: <count> as a schedulable resource on each node, and exposes the host devices to pods that request it. A pod spec with resources.limits: { "nvidia.com/gpu": 1 } will only be scheduled on a node with a free GPU. Multi-Instance GPU (MIG) on A100/H100 lets one physical GPU advertise as multiple smaller resources; time-slicing and MPS are the alternatives when MIG is not available.

ML-specific layers sit on top of vanilla Kubernetes. Kubeflow is an umbrella project for ML pipelines (Argo-based DAGs), notebook serving, and hyperparameter tuning (Katib). KServe (formerly KFServing) is a serverless model-serving framework with autoscaling, canary rollouts, and built-in support for TensorFlow Serving, TorchServe, and Triton. Volcano adds gang scheduling (all-or-nothing pod groups, needed for distributed training where one missing worker stalls the job), queue priorities, and fair sharing.

When You'd Use It

Kubernetes is the right answer for production model serving at a company with multiple services, a platform team, and traffic that needs autoscaling, blue-green deploys, and observability. It is the wrong answer for a solo researcher running training jobs: the YAML overhead and cluster operating cost dwarf the benefit when the alternative is python train.py on a rented GPU box.

Lightweight alternatives. HashiCorp Nomad is a simpler orchestrator with a single binary. AWS ECS and Google Cloud Run are managed container runtimes; Cloud Run scales to zero and is the cheapest for spiky low-volume serving. Modal and Beam are serverless GPU platforms targeted at ML; you write a Python function with hardware requirements and they handle scheduling, scaling, and cold start.

Notable Gotchas

Watch Out

GPU pods cannot share a GPU by default

The NVIDIA device plugin allocates whole GPUs by default. Two pods both requesting nvidia.com/gpu: 1 will land on different GPUs, and a pod requesting one GPU will block all other GPU requests on that device until it terminates. To share, enable MIG on supported GPUs (A100, H100), enable time-slicing in the device plugin config, or run NVIDIA MPS. Without one of these, an inference pod that uses 10 percent of a GPU still occupies the whole device on the cluster's accounting.

Watch Out

Distributed training without a gang scheduler deadlocks under contention

A four-worker PyTorch DDP job needs all four pods running before any can make progress. Kubernetes' default scheduler places pods one at a time. Under contention, three pods may land while the fourth waits for a node, holding three GPUs idle. Volcano, Kueue, or YuniKorn implement gang scheduling: the four pods are scheduled atomically or not at all. Any multi-node training on a shared cluster needs one of these.

References

Kubernetes, Concepts: Workloads (https://kubernetes.io/docs/concepts/workloads/).
Burns, Grant, Oppenheimer, Brewer, Wilkes, Borg, Omega, and Kubernetes (Communications of the ACM 59, 2016).
NVIDIA, Kubernetes Device Plugin for NVIDIA GPUs (https://github.com/NVIDIA/k8s-device-plugin).
KServe, KServe Documentation (https://kserve.github.io/website/).
Volcano, Volcano: A Cloud Native Batch System (https://volcano.sh/en/docs/).

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

2

GPU Compute Modellayer 5 · tier 2
Inference Systems Overviewlayer 5 · tier 2

Graph-backed continuations

Inference Systems Overview GPU Compute Model