2 posts tagged with "open-source"

Adding Global Config to the NVIDIA Network Operator

April 6, 2026 · 7 min read

Founder, BaaZ · Apache Software Foundation Member · NVIDIA NCP-AII

If you've ever written a NicClusterPolicy manifest for the NVIDIA Network Operator, you know the pain: the same repository, version, and imagePullSecrets copied and pasted across every single sub-component. OFED driver, RDMA shared device plugin, SR-IOV device plugin, Multus, CNI plugins, IPAM plugin, NV-IPAM — each one needs its own repository: nvcr.io/nvidia/mellanox and version: network-operator-v25.7.0. Change the version during an upgrade, and you're editing 8+ places in the same YAML. Miss one, and you get a partially upgraded cluster with mismatched component versions.

We recently contributed a fix for this: global config support for NicClusterPolicy (PR #2070). It's now merged into the NVIDIA Network Operator, and this post explains the problem, the implementation, and why it matters for anyone operating RDMA-capable GPU clusters on Kubernetes.

Contributing a Queue Validator to the KAI Scheduler

April 5, 2026 · 6 min read

Adheip Singh

Founder, BaaZ · Apache Software Foundation Member · NVIDIA NCP-AII

If you run GPU workloads on Kubernetes at any meaningful scale, you've probably hit a point where the default scheduler isn't enough. Fractional GPU requests, quota enforcement, gang scheduling, preemption — none of that comes out of the box. That's the gap KAI Scheduler fills.

KAI Scheduler is NVIDIA's open-source Kubernetes-native GPU scheduler, originally built inside the Run:ai platform and released under the Apache 2.0 license in April 2025. It's now a CNCF Sandbox project with over 1,200 GitHub stars, and it's quickly becoming the go-to scheduler for teams running AI workloads on Kubernetes — whether on-prem, colo, or cloud.

At BaaZ, we work with KAI Scheduler in production GPU clusters. We recently contributed a queue validation webhook (PR #857) that prevents a class of misconfiguration bugs in hierarchical queue setups. This post explains the problem, the fix, and why it matters for anyone operating multi-tenant GPU infrastructure.