Skip to main content

3 posts tagged with "rdma"

View All Tags

· 7 min read

If you've ever written a NicClusterPolicy manifest for the NVIDIA Network Operator, you know the pain: the same repository, version, and imagePullSecrets copied and pasted across every single sub-component. OFED driver, RDMA shared device plugin, SR-IOV device plugin, Multus, CNI plugins, IPAM plugin, NV-IPAM — each one needs its own repository: nvcr.io/nvidia/mellanox and version: network-operator-v25.7.0. Change the version during an upgrade, and you're editing 8+ places in the same YAML. Miss one, and you get a partially upgraded cluster with mismatched component versions.

We recently contributed a fix for this: global config support for NicClusterPolicy (PR #2070). It's now merged into the NVIDIA Network Operator, and this post explains the problem, the implementation, and why it matters for anyone operating RDMA-capable GPU clusters on Kubernetes.

· 6 min read

If you're building a multi-node GPU cluster for distributed training, you've probably run into a confusing mess of terminology — NVLink, NVSwitch, InfiniBand, RoCE, GPUDirect. Half the blog posts out there mix these up, and vendor documentation assumes you already know what you're doing.

So let's sort this out.