Question 1

What is the NVIDIA GPU Operator?

Accepted Answer

A set of Kubernetes operators that automate the lifecycle of GPU drivers, Container Toolkit, device plugin, DCGM exporter, and MIG manager across every GPU node. You need it any time you want GPUs scheduled as Kubernetes resources.

Question 2

How does GPU sharing work in Kubernetes?

Accepted Answer

Three modes: MIG for hardware partitioning on A100/H100 (hard isolation, fixed sizes), time-slicing for simple time-multiplexing (no isolation), and MPS for CUDA-level process sharing. Pick MIG for multi-tenant production; time-slicing for dev/inference.

Question 3

What is the KAI Scheduler?

Accepted Answer

KAI Scheduler (formerly Run:ai scheduler) is a Kubernetes-native gang scheduler purpose-built for GPU workloads: queues, fair-share, gang scheduling, and preemption with GPU-awareness. We're an active contributor to this project.

Question 4

Can I run GPU workloads on EKS, GKE, or AKS?

Accepted Answer

Yes. All three support GPU node groups, and the GPU Operator runs on top. The complications are around driver versions, instance-type-specific CUDA images, multi-tenancy isolation, and in-cluster networking (especially RDMA or EFA).

Question 5

Do I need Slurm if I already run Kubernetes?

Accepted Answer

Not usually. Kubernetes with GPU Operator, KAI or Volcano scheduler, and Kubeflow Training Operator covers most distributed-training workloads. Slurm still wins for traditional HPC or organizations with deep Slurm operational expertise.

Question 6

How do you approach multi-tenant GPU clusters?

Accepted Answer

Namespace quotas, ResourceQuotas on nvidia.com/gpu, a GPU-aware scheduler for fair-share, MIG or SR-IOV for hardware isolation where needed, node taints/tolerations for workload separation, and per-namespace DCGM metrics for visibility.

GPU Kubernetes Consulting

What We Do

Proof

How We Work

Technologies

Related

Frequently Asked Questions