Skip to main content

5 posts tagged with "networking"

View All Tags

· 7 min read

If you've ever written a NicClusterPolicy manifest for the NVIDIA Network Operator, you know the pain: the same repository, version, and imagePullSecrets copied and pasted across every single sub-component. OFED driver, RDMA shared device plugin, SR-IOV device plugin, Multus, CNI plugins, IPAM plugin, NV-IPAM — each one needs its own repository: nvcr.io/nvidia/mellanox and version: network-operator-v25.7.0. Change the version during an upgrade, and you're editing 8+ places in the same YAML. Miss one, and you get a partially upgraded cluster with mismatched component versions.

We recently contributed a fix for this: global config support for NicClusterPolicy (PR #2070). It's now merged into the NVIDIA Network Operator, and this post explains the problem, the implementation, and why it matters for anyone operating RDMA-capable GPU clusters on Kubernetes.

· 6 min read

If you're building a multi-node GPU cluster for distributed training, you've probably run into a confusing mess of terminology — NVLink, NVSwitch, InfiniBand, RoCE, GPUDirect. Half the blog posts out there mix these up, and vendor documentation assumes you already know what you're doing.

So let's sort this out.

· 6 min read

I was lucky to work on a computer vision setup which involved NVIDIA RTX 6000 GPUs, Mellanox ONYX switches, and high-resolution cameras. The system was designed for real-time video capture and processing, pushing massive amounts of data through our network infrastructure. Here's a gist of debugging Mellanox switch metrics that led to some surprising discoveries about network traffic flow.

When monitoring network equipment like switches, routers, or network cards, you'll constantly encounter two metrics: RX and TX. These simple abbreviations are fundamental to understanding how data flows through your network, yet they often cause confusion. Let's demystify them with real-world examples from our production Mellanox ONYX switch.