GPU Kubernetes Consulting
Your GPU cluster is running at 25% utilization. Teams wait days for GPU access. Training jobs fail because the scheduler doesn't understand GPU topology. Kubernetes can run GPUs well — it just needs someone who's done it before.
What We Do
- GPU Operator stack — Driver containers, Container Toolkit, Device Plugin, DCGM Exporter, GPU Feature Discovery. We handle driver conflicts, runtime differences, secure boot, upgrade rollouts
- KAI Scheduler — Topology-aware placement, fair-share scheduling, gang scheduling for distributed training, preemption policies, queue management. We're an active contributor
- MIG & fractional GPU sharing — A100/H100 MIG partitioning, time-slicing for non-MIG GPUs, workload-aware partition profiles
- Multi-tenancy — Namespace isolation, GPU resource quotas, RBAC, priority classes, cost allocation and chargeback
- EKS / GKE / bare metal — GPU node groups with EFA, GKE GPU pools with multi-networking, bare-metal with Calico/Cilium + MetalLB. We've shipped all three
- Training job orchestration — Kubeflow Training Operator, PyTorchJob, integration with MLflow and W&B
Proof
We deployed a 3-node Kubespray cluster with GPU Operator, KAI Scheduler, JupyterHub, and full RDMA networking for a client — bare metal, 2x RTX 5000 Ada + 1x RTX A5500, with Traefik Gateway API and NFS CSI storage. Production-ready in weeks, not months.
We also contributed the queue validation webhook to KAI Scheduler — we know this codebase from the inside.
How We Work
Assess
Audit your K8s GPU setup, scheduler config, and utilization.
Design
Right-size the operator stack, scheduling policy, tenancy model.
Implement
Deploy, configure, validate with real workloads.
Transfer
Runbooks, dashboards, and training for your platform team.
Technologies
Related
Frequently Asked Questions
What is the NVIDIA GPU Operator?
A set of Kubernetes operators that automate the lifecycle of GPU drivers, Container Toolkit, device plugin, DCGM exporter, and MIG manager across every GPU node. You need it any time you want GPUs scheduled as Kubernetes resources.
How does GPU sharing work in Kubernetes?
Three modes: MIG for hardware partitioning on A100/H100 (hard isolation, fixed sizes), time-slicing for simple time-multiplexing (no isolation), and MPS for CUDA-level process sharing. Pick MIG for multi-tenant production; time-slicing for dev/inference.
What is the KAI Scheduler?
KAI Scheduler (formerly Run:ai scheduler) is a Kubernetes-native gang scheduler purpose-built for GPU workloads: queues, fair-share, gang scheduling, and preemption with GPU-awareness. We're an active contributor to this project.
Can I run GPU workloads on EKS, GKE, or AKS?
Yes. All three support GPU node groups, and the GPU Operator runs on top. The complications are around driver versions, instance-type-specific CUDA images, multi-tenancy isolation, and in-cluster networking (especially RDMA or EFA).
Do I need Slurm if I already run Kubernetes?
Not usually. Kubernetes with GPU Operator, KAI or Volcano scheduler, and Kubeflow Training Operator covers most distributed-training workloads. Slurm still wins for traditional HPC or organizations with deep Slurm operational expertise.
How do you approach multi-tenant GPU clusters?
Namespace quotas, ResourceQuotas on nvidia.com/gpu, a GPU-aware scheduler for fair-share, MIG or SR-IOV for hardware isolation where needed, node taints/tolerations for workload separation, and per-namespace DCGM metrics for visibility.
Need Help Running GPUs on Kubernetes?
We've deployed GPU Operator and KAI Scheduler on EKS, GKE, and bare metal. Let's look at your cluster.
Schedule a Call