Infrastructure
Kubernetes for ML Workloads
Kubernetes pod / deployment / service model, GPU scheduling via device plugins, and the ML-specific layers (Kubeflow, KServe, Volcano). Standard for production model serving; overkill for solo research.
What It Is
Kubernetes is a cluster orchestrator that runs containerized workloads across a fleet of machines. The core abstractions: a pod is one or more containers scheduled together on a node; a deployment is a controller that maintains a target replica count of identical pods; a service is a stable virtual IP and DNS name that load-balances across the pods behind it; a namespace is a logical partition for naming, RBAC, and quota. The control plane (API server, scheduler, controller manager, etcd) reconciles desired state from YAML manifests against observed cluster state. kubectl is the CLI; kubectl apply -f deploy.yaml and kubectl get pods are the two commands you run constantly.
GPU scheduling works through the device plugin interface. The NVIDIA k8s-device-plugin runs as a DaemonSet, advertises nvidia.com/gpu: <count> as a schedulable resource on each node, and exposes the host devices to pods that request it. A pod spec with resources.limits: { "nvidia.com/gpu": 1 } will only be scheduled on a node with a free GPU. Multi-Instance GPU (MIG) on A100/H100 lets one physical GPU advertise as multiple smaller resources; time-slicing and MPS are the alternatives when MIG is not available.
ML-specific layers sit on top of vanilla Kubernetes. Kubeflow is an umbrella project for ML pipelines (Argo-based DAGs), notebook serving, and hyperparameter tuning (Katib). KServe (formerly KFServing) is a serverless model-serving framework with autoscaling, canary rollouts, and built-in support for TensorFlow Serving, TorchServe, and Triton. Volcano adds gang scheduling (all-or-nothing pod groups, needed for distributed training where one missing worker stalls the job), queue priorities, and fair sharing.
When You'd Use It
Kubernetes is the right answer for production model serving at a company with multiple services, a platform team, and traffic that needs autoscaling, blue-green deploys, and observability. It is the wrong answer for a solo researcher running training jobs: the YAML overhead and cluster operating cost dwarf the benefit when the alternative is python train.py on a rented GPU box.
Lightweight alternatives. HashiCorp Nomad is a simpler orchestrator with a single binary. AWS ECS and Google Cloud Run are managed container runtimes; Cloud Run scales to zero and is the cheapest for spiky low-volume serving. Modal and Beam are serverless GPU platforms targeted at ML; you write a Python function with hardware requirements and they handle scheduling, scaling, and cold start.
Notable Gotchas
GPU pods cannot share a GPU by default
The NVIDIA device plugin allocates whole GPUs by default. Two pods both requesting nvidia.com/gpu: 1 will land on different GPUs, and a pod requesting one GPU will block all other GPU requests on that device until it terminates. To share, enable MIG on supported GPUs (A100, H100), enable time-slicing in the device plugin config, or run NVIDIA MPS. Without one of these, an inference pod that uses 10 percent of a GPU still occupies the whole device on the cluster's accounting.
Distributed training without a gang scheduler deadlocks under contention
A four-worker PyTorch DDP job needs all four pods running before any can make progress. Kubernetes' default scheduler places pods one at a time. Under contention, three pods may land while the fourth waits for a node, holding three GPUs idle. Volcano, Kueue, or YuniKorn implement gang scheduling: the four pods are scheduled atomically or not at all. This is essentially mandatory for any multi-node training on a shared cluster.
References
Related Topics
Last reviewed: April 18, 2026