Skip to main content

4 posts tagged with "kubernetes"

View All Tags

· 7 min read

If you've ever written a NicClusterPolicy manifest for the NVIDIA Network Operator, you know the pain: the same repository, version, and imagePullSecrets copied and pasted across every single sub-component. OFED driver, RDMA shared device plugin, SR-IOV device plugin, Multus, CNI plugins, IPAM plugin, NV-IPAM — each one needs its own repository: nvcr.io/nvidia/mellanox and version: network-operator-v25.7.0. Change the version during an upgrade, and you're editing 8+ places in the same YAML. Miss one, and you get a partially upgraded cluster with mismatched component versions.

We recently contributed a fix for this: global config support for NicClusterPolicy (PR #2070). It's now merged into the NVIDIA Network Operator, and this post explains the problem, the implementation, and why it matters for anyone operating RDMA-capable GPU clusters on Kubernetes.

· 6 min read

If you run GPU workloads on Kubernetes at any meaningful scale, you've probably hit a point where the default scheduler isn't enough. Fractional GPU requests, quota enforcement, gang scheduling, preemption — none of that comes out of the box. That's the gap KAI Scheduler fills.

KAI Scheduler is NVIDIA's open-source Kubernetes-native GPU scheduler, originally built inside the Run:ai platform and released under the Apache 2.0 license in April 2025. It's now a CNCF Sandbox project with over 1,200 GitHub stars, and it's quickly becoming the go-to scheduler for teams running AI workloads on Kubernetes — whether on-prem, colo, or cloud.

At BaaZ, we work with KAI Scheduler in production GPU clusters. We recently contributed a queue validation webhook (PR #857) that prevents a class of misconfiguration bugs in hierarchical queue setups. This post explains the problem, the fix, and why it matters for anyone operating multi-tenant GPU infrastructure.

· 3 min read

In this article, we explore the use cases of Helm and Kubernetes operators, demystifying when to use each tool, based on BaaZ's experience in developing and maintaining Kubernetes operators, controllers, and libraries

Introduction

In the realm of software development, the principle of separation of concerns holds great significance. It emphasises the need to divide a system into distinct parts, each responsible for a specific aspect of functionality. This approach promotes modularity, maintainability, and scalability, allowing developers to focus on specific areas without unnecessary dependencies. When it comes to managing Kubernetes deployments, the Helm tool and the operator pattern play crucial roles in adhering to this principle. In this blog post, we will explore the separation of concerns in Helm and the benefits of using operators in Kubernetes deployments.