Skip to main content

6 posts tagged with "networking"

View All Tags

· 15 min read

Macvlan, host-device, SR-IOV, and IPoIB — what they are, how they differ, and when to use each for RDMA and NCCL traffic in GPU training clusters.

{/ truncate /}

Secondary network types in the NVIDIA Network Operator

Why GPU pods need two networks

Every GPU training pod needs two distinct network paths. The management network is the standard Kubernetes pod network — Flannel, Calico, Cilium — carrying API traffic, health checks, metrics, and DNS. The training network is a dedicated, high-bandwidth path that carries NCCL collective operations (all-reduce, all-gather, broadcast) between GPUs across nodes.

Mixing both on a single interface doesn't work for serious GPU workloads. NCCL traffic is latency-sensitive and bandwidth-hungry. It needs to bypass kube-proxy, skip the overlay network, and in the case of RDMA, bypass the kernel entirely. The management network, by contrast, is low-bandwidth but needs reliable service discovery and DNS.

The Kubernetes-native way to give a pod two interfaces is Multus CNI. The primary CNI plugin provides eth0 for management. Multus attaches a secondary interface (net1) backed by a physical RDMA-capable NIC, giving the pod direct access to the high-speed fabric. Getting this dual-interface plumbing right is the foundation of all our GPU networking and RDMA consulting engagements.

The question is how that secondary interface gets attached to the physical NIC. There are four options, each with different isolation, performance, and complexity characteristics. These four options map directly to the four secondary network CRDs in the NVIDIA Network Operator, which automates deploying Multus, MOFED drivers, device plugins, and secondary network definitions across a cluster through a single NicClusterPolicy CR.


Macvlan — shared access, software isolation

Macvlan secondary network architecture

Macvlan is the simplest secondary network type. It creates a virtual sub-interface on top of a physical NIC. The host keeps the original interface, and each pod gets a new virtual interface with its own MAC address, backed by the same physical port. Multiple pods share the same NIC simultaneously.

At the kernel level, macvlan operates in bridge mode by default, which means macvlan sub-interfaces on the same host can communicate directly without hitting the physical wire. Traffic to external hosts goes through the PF (Physical Function) normally. Inside the pod, the secondary interface appears as net1@ifN — the @ifN suffix is the giveaway that it's a software sub-interface, not a real device.

Macvlan by itself only provides L2/L3 networking. To get RDMA, you pair it with the RDMA shared device plugin, which discovers RDMA-capable NICs on the host and exposes /dev/infiniband/* device files as Kubernetes extended resources. When a pod requests both the macvlan network and an RDMA resource, it gets a secondary interface and access to RDMA verbs.

The key detail: all pods on the same node share the same physical RDMA HCA. Each pod creates its own Queue Pairs on the PF, but the NIC hardware is not partitioned. Isolation is purely at the QP level in software.

In the Network Operator, you define a MacvlanNetwork CR specifying the master interface, mode, MTU, and IPAM configuration. The operator renders this into a Multus NetworkAttachmentDefinition automatically. For the full set of operator-wide defaults and image settings that make this work, see Global configuration in the NVIDIA Network Operator.

When to use macvlan. It's the right choice for POCs, development environments, or any setup where you need a quick dual-interface pod without fussing over firmware or operator complexity. It's also the only option when you have a single physical NIC port — macvlan can share the same port that carries your primary CNI overlay traffic. The tradeoff is no hardware isolation between pods. All pods talk through the same PF and share NIC resources.


Host-device — exclusive access, the NIC moves into the pod

Host-device secondary network architecture

Host-device takes a fundamentally different approach. Instead of creating a virtual sub-interface, the host-device CNI plugin moves the physical network interface itself from the host's network namespace into the pod's network namespace. The NIC literally disappears from ip link on the host while the pod is running and reappears when the pod terminates.

This gives the pod exclusive, unshared access to the physical device. No other pod — and not even the host — can use that interface. Inside the pod, the secondary interface appears as a plain net1 without the @ifN suffix, because it's not a virtual sub-interface. It's the real hardware.

Since the pod owns the entire physical device, it also owns the associated RDMA HCA. Every QP, every hardware flow table, every byte of NIC memory belongs to that one pod. This is the strongest isolation model short of giving a node to a single workload.

Like macvlan, host-device uses the RDMA shared device plugin to expose RDMA resources. The difference is entirely in the CNI plugin — macvlan creates a sub-interface, host-device moves the real device. In the Network Operator, a HostDeviceNetwork CR defines the secondary network, and the resourceName field links it to the device plugin's resource rather than naming a master interface directly.

When to use host-device. This is the production choice when each node has a dedicated NIC for training traffic. The standard GPU cluster topology is two NICs per node: one for Kubernetes management (backing Flannel/Calico), one for NCCL training (given entirely to the training pod via host-device). The pod gets the full NIC bandwidth and dedicated RDMA resources with zero contention. We walk through this exact pattern end-to-end in Dual-network RDMA on Kubernetes with GH200.

The obvious limitation is that only one pod can use each NIC at a time. If you need multiple training pods per node sharing the same fabric, you need macvlan (software sharing) or SR-IOV (hardware partitioning). Host-device also doesn't work for secondary networks on single-NIC nodes — moving the only NIC into a pod kills the host's connectivity to the Kubernetes API server.


SR-IOV — hardware-partitioned virtual functions

SR-IOV secondary network architecture

SR-IOV (Single Root I/O Virtualization) is a PCIe specification that lets a single physical NIC present itself as multiple independent virtual devices. The physical device is the Physical Function (PF). Each virtual device is a Virtual Function (VF). VFs are real PCIe functions — they show up in lspci, have their own driver bindings, their own MAC addresses, and their own RDMA contexts.

When a pod requests an SR-IOV VF, the SR-IOV CNI plugin moves that VF's netdev into the pod's network namespace — mechanically similar to host-device, but operating on a VF rather than the PF. The PF stays on the host, manages all VFs, and handles switching between them at line rate in NIC hardware.

This gives you the middle ground between macvlan and host-device: hardware-level isolation (each pod gets its own PCIe function with dedicated RDMA resources) combined with sharing (many pods use VFs carved from the same physical NIC). Each VF appears as its own mlx5_X RDMA device inside the pod, completely independent from other VFs and from the PF.

SR-IOV requires upfront configuration. The NIC firmware must have SR-IOV enabled (SRIOV_EN=1 via mlxconfig), and VFs must be created on each node either manually through sysfs or automatically through the SR-IOV Network Operator. For Mellanox ConnectX NICs, VFs use native-bifurcating SR-IOV — they stay bound to the mlx5_core kernel driver and appear as regular netdevs. This is different from Intel NICs, which require VFs to be bound to vfio-pci for userspace access. Getting the deviceType right (netdevice for Mellanox, vfio-pci for Intel) is one of the most common SR-IOV configuration mistakes.

In the Network Operator, SR-IOV is managed through the embedded SR-IOV Network Operator (deployed as a Helm sub-chart). You define a SriovNetworkNodePolicy CR to specify which PFs to partition and how many VFs to create, and a SriovNetwork CR to define the secondary network. The operator handles VF creation, SR-IOV device plugin deployment, and NetworkAttachmentDefinition generation. Unlike macvlan and host-device, SR-IOV uses its own dedicated device plugin (sriovDevicePlugin) that discovers VFs specifically, rather than the shared RDMA device plugin that discovers PFs.

VF count is a hard firmware limit — ConnectX-7 supports up to 127 VFs per port, but each VF consumes NIC resources (queues, memory, steering rules). If you create 8 VFs and 9 pods need SR-IOV resources, the 9th pod stays Pending. This is different from the RDMA shared device plugin's rdmaHcaMax, which is a soft configuration limit.

When to use SR-IOV. It's the production choice for multi-tenant GPU clusters where multiple training jobs run on the same node and need hardware-isolated access to the RDMA fabric. Each pod gets its own VF with dedicated NIC resources, and the PF firmware handles switching at line rate. The cost is complexity: firmware configuration, operator deployment, and VF capacity planning.


IPoIB — IP over InfiniBand

IPoIB secondary network architecture

IPoIB is not the same thing as the other three. Macvlan, host-device, and SR-IOV all operate on Ethernet networks (including RoCE — RDMA over Converged Ethernet). IPoIB operates on a completely different link layer: InfiniBand.

InfiniBand is a lossless, credit-based fabric with its own L2 and L3 layers, managed by a Subnet Manager (SM) running on a managed switch or a dedicated node. Devices identify themselves by GUIDs and port GIDs rather than MAC addresses. Partitions (PKEYs) provide L2 isolation, analogous to VLANs on Ethernet but enforced at the fabric level.

IPoIB is a kernel module (ib_ipoib) that creates a standard IP network interface on top of InfiniBand transport. It encapsulates IP packets into InfiniBand messages so that normal TCP/UDP applications — health checks, monitoring agents, SSH — can communicate over the IB fabric without needing native RDMA verbs. The kernel interface appears as ib0, ib1, etc.

In Kubernetes, the IPoIB CNI plugin creates child IPoIB interfaces (sub-partitions) and moves them into pod network namespaces. Conceptually this is similar to macvlan — the host keeps the parent IPoIB interface, and pods get child interfaces backed by the same IB port — but the underlying mechanism uses IB partition keys rather than MAC-based sub-interfaces.

The Network Operator adds a component called ib-kubernetes for IPoIB deployments. It integrates with the InfiniBand Subnet Manager (typically NVIDIA UFM) to manage PKEY memberships for pods. When a pod joins an IPoIB network, ib-kubernetes ensures the pod's GUID is added to the correct partition in the SM and removes it on termination. This fabric-level tenant isolation is unique to InfiniBand — Ethernet has no equivalent.

An IPoIBNetwork CR in the Network Operator specifies the master IB interface and IPAM configuration. The operator deploys the IPoIB CNI plugin as part of the secondaryNetwork section in NicClusterPolicy.

For RDMA workloads on IB, the IPoIB interface is mainly for management-plane traffic. NCCL and other RDMA-aware applications bypass IPoIB entirely and use native IB verbs for GPU-to-GPU communication. Pods still need access to IB RDMA device files via the RDMA shared device plugin. If per-pod VF isolation is needed on InfiniBand, the ib-sriov-cni plugin handles SR-IOV VFs on IB interfaces — combining IB's native RDMA with SR-IOV's hardware partitioning.

When IPoIB applies. It's for clusters with InfiniBand fabric — typically national labs, large-scale HPC environments, and on-prem AI training clusters that use managed IB switches with a Subnet Manager. If your ConnectX NICs are running in Ethernet/RoCE mode — which covers most cloud, colo, and enterprise GPU clusters today — IPoIB is not relevant to your setup. DGX SuperPOD reference architectures historically used InfiniBand, though newer designs increasingly use RoCE with Spectrum-X switches.


Choosing the right type

MacvlanHost-deviceSR-IOVIPoIB
Link layerEthernet / RoCEEthernet / RoCEEthernet / RoCEInfiniBand
IsolationSoftware (shared QPs on PF)Hardware (exclusive PF)Hardware (dedicated VF per pod)Software (shared IB port, PKEY isolation)
NIC sharingMultiple pods per NICOne pod per NICMultiple VFs per NICMultiple child interfaces per port
RDMA deviceShared PF (mlx5_0)Exclusive PF (mlx5_0)Dedicated VF (mlx5_2, mlx5_3…)Shared IB HCA
ComplexityLowLowMedium (firmware, operator)High (Subnet Manager, PKEYs)
Firmware changesNoneNoneRequiredNone (SR-IOV on IB requires it)
Device pluginRDMA sharedRDMA sharedSR-IOV device pluginRDMA shared

The decision usually comes down to two questions: what link layer is your high-speed fabric, and how many pods per node need access to it.

If your fabric is InfiniBand, IPoIB is your secondary network type and you add ib-kubernetes for partition management. Everything else assumes Ethernet/RoCE.

On Ethernet, if you have a dedicated training NIC per node and run one training pod at a time, host-device is the simplest path — exclusive NIC access with zero overhead. If multiple pods per node need hardware-isolated access to the same NIC, SR-IOV gives each pod its own VF with dedicated PCIe resources. If you don't need hardware isolation or you're running a quick POC on a single-NIC node, macvlan is the easiest starting point.

Macvlan and host-device use the same device plugin (RDMA shared device plugin) and differ only in the CNI — meaning you can switch between them by changing a single CRD. SR-IOV uses its own device plugin and requires firmware-level changes, so the switch is bigger. All four types can coexist in a single cluster through the Network Operator's NicClusterPolicy, and you can mix them — host-device for training pods, macvlan for monitoring sidecars — on the same set of nodes.


How these map to the Network Operator

The NVIDIA Network Operator orchestrates all four types through a single NicClusterPolicy CR. Each sub-state in the policy corresponds to a component:

NicClusterPolicy
├── ofedDriver ← all types (MOFED driver container)
├── rdmaSharedDevicePlugin ← macvlan, host-device, IPoIB
├── sriovDevicePlugin ← SR-IOV
├── ibKubernetes ← IPoIB (PKEY management)
├── secondaryNetwork
│ ├── multus ← all types
│ ├── cniPlugins ← all types (macvlan, host-device binaries)
│ └── ipoib ← IPoIB (ipoib CNI plugin)
└── sriovNetworkOperator (Helm) ← SR-IOV (sub-chart)

Once the NicClusterPolicy is applied and healthy, you create the appropriate network CRD — MacvlanNetwork, HostDeviceNetwork, IPoIBNetwork, or SriovNetworkNodePolicy + SriovNetwork — and the operator generates the NetworkAttachmentDefinition that Multus needs. Pods reference it by name in their k8s.v1.cni.cncf.io/networks annotation and get their secondary interface on schedule.

The operator is not strictly required — for small clusters or POCs, you can deploy each component manually for full visibility into every layer. The operator earns its keep at scale, where managing MOFED driver versions, device plugin configs, and Multus across dozens of nodes by hand becomes a maintenance problem.


For hands-on walkthroughs of these network types with working manifests, see our GH200 dual-network RDMA guide (macvlan) and SR-IOV POC guide (SR-IOV).

{/ TODO: review FAQ answers before publishing — drafted from post content, refine for positioning. /}

Frequently Asked Questions

Why does a GPU training pod need a secondary network in Kubernetes?

The default pod network (Flannel, Calico, Cilium) is optimized for management traffic, not collective communication. NCCL AllReduce and AllGather between GPUs require line-rate bandwidth and microsecond latency, which means bypassing kube-proxy, the overlay, and ideally the kernel via RDMA. A secondary network attached with Multus gives the pod direct access to an RDMA-capable NIC for training traffic.

What is the difference between macvlan and host-device secondary networks?

Macvlan creates a virtual sub-interface on top of a physical NIC and can be shared by multiple pods on the same host; isolation is software-level. Host-device moves the entire physical NIC out of the host network namespace into a single pod, giving that pod exclusive ownership of the device and its RDMA HCA. Macvlan is easier; host-device is stronger isolation.

When should I use SR-IOV instead of host-device for GPU training?

Use SR-IOV when multiple training pods per node need hardware-isolated access to the same RDMA fabric. SR-IOV carves a physical NIC into multiple Virtual Functions (VFs), each with dedicated PCIe queues and RDMA context. Host-device only works when one pod per node takes the entire NIC, so SR-IOV is the right choice for multi-tenant clusters or multi-job per node scheduling.

Does IPoIB apply to RoCE clusters?

No. IPoIB (IP over InfiniBand) is specific to InfiniBand fabrics managed by a Subnet Manager with PKEY-based isolation. On Ethernet/RoCE clusters — which covers most cloud, colo, and enterprise GPU deployments — you use macvlan, host-device, or SR-IOV over Ethernet. IPoIB mostly shows up in HPC environments and older DGX SuperPOD reference designs.

Can I mix multiple secondary network types in the same cluster?

Yes. The NVIDIA Network Operator orchestrates macvlan, host-device, SR-IOV, and IPoIB together through a single `NicClusterPolicy` CR. A common pattern is host-device for training pods and macvlan for monitoring or sidecar pods on the same nodes. You can also run SR-IOV on some nodes and host-device on others to match workload shape.


· 7 min read

If you've ever written a NicClusterPolicy manifest for the NVIDIA Network Operator, you know the pain: the same repository, version, and imagePullSecrets copied and pasted across every single sub-component. OFED driver, RDMA shared device plugin, SR-IOV device plugin, Multus, CNI plugins, IPAM plugin, NV-IPAM — each one needs its own repository: nvcr.io/nvidia/mellanox and version: network-operator-v25.7.0. Change the version during an upgrade, and you're editing 8+ places in the same YAML. Miss one, and you get a partially upgraded cluster with mismatched component versions.

We recently contributed a fix for this: global config support for NicClusterPolicy (PR #2070). It's now merged into the NVIDIA Network Operator, and this post explains the problem, the implementation, and why it matters for anyone operating RDMA-capable GPU clusters on Kubernetes.

· 10 min read

A practical guide to understanding why your multi-node GPU training might be slower than expected.

{/ truncate /}


You've set up distributed training across multiple GPU servers. PyTorch DDP is configured. The job is running. But something's wrong—it's not faster than single-node training. It might even be slower.

Before you blame the framework, the drivers, or the model, check your network. In most cases, the network is the bottleneck. This is exactly the class of problem our distributed training optimization engagements were built around.

This post will teach you how to calculate whether your network is limiting your distributed training performance, and by how much.


The Training Loop: Where Time Goes

Every training iteration has the same structure:

  1. Forward pass → GPU computes predictions
  2. Loss calculation → Compare predictions to ground truth
  3. Backward pass → GPU computes gradients
  4. AllReduce → Sync gradients across all GPUs (NETWORK)
  5. Weight update → Apply gradients to model

Steps 1, 2, 3, and 5 happen on the GPU. Step 4 happens over the network.

The question is: how much time does each step take?


What is AllReduce?

In distributed training, each GPU processes a different batch of data and computes its own gradients. But all GPUs need to end up with the same gradients to keep the model in sync.

AllReduce is the collective operation that:

  1. Collects gradients from all GPUs
  2. Computes the average
  3. Distributes the result back to all GPUs

AllReduce Diagram

The key insight: the entire gradient payload must cross the network every iteration. For a deeper look at what actually moves between nodes during this step, see GPU-to-GPU communication across nodes.


Step 1: Calculate Your Gradient Size

Gradient size is determined by your model:

Gradient size = Number of parameters × Bytes per parameter

Bytes per parameter:

  • FP32 (single precision): 4 bytes
  • FP16 (half precision): 2 bytes
  • BF16 (brain float): 2 bytes

Common models:

ModelParametersGradient (FP32)Gradient (FP16)
ResNet-5025M100 MB50 MB
BERT-base110M440 MB220 MB
BERT-large340M1.4 GB700 MB
GPT-2 (1.5B)1.5B6 GB3 GB
LLaMA-7B7B28 GB14 GB

For your model, simply multiply the parameter count by 4 (for FP32) or 2 (for FP16).


Step 2: Know Your Network Throughput

Common network configurations and their real-world throughput:

NetworkReported SpeedActual Throughput
1 GbE~940 Mbps~110 MB/s
10 GbE~9.4 Gbps~1.1 GB/s
25 GbE~23.5 Gbps~2.8 GB/s
100 GbE (TCP)~80 Gbps~9 GB/s
100 GbE (RDMA)~95 Gbps~11 GB/s

Why is actual throughput lower than line rate?

Network protocols have overhead. A 1 GbE link runs at 1000 Mbps (megabits), which equals 125 MB/s (megabytes). But Ethernet framing, TCP/IP headers, and protocol overhead reduce this to roughly 110 MB/s in practice.

You can verify your throughput with iperf between your nodes.


Step 3: Calculate AllReduce Time

Now you can estimate your AllReduce time:

AllReduce time = Gradient size ÷ Network throughput

Example: BERT-base on 1 GbE

Gradient size: 440 MB
Network throughput: 110 MB/s

AllReduce time = 440 ÷ 110 = 4.0 seconds

Example: BERT-base on 100 GbE RDMA

Gradient size: 440 MB
Network throughput: 11,000 MB/s

AllReduce time = 440 ÷ 11,000 = 0.04 seconds = 40 ms

That's a 100x difference in network time alone.


Step 4: Estimate GPU Compute Time

GPU compute time (forward + backward pass) varies significantly based on model, batch size, sequence length, precision, and hardware.

For transformer models like BERT, a single iteration typically takes tens to hundreds of milliseconds on modern data center GPUs. The backward pass generally takes 2-3x longer than the forward pass.

For accurate benchmarks on your specific hardware, refer to these official sources:

Relative performance between GPU generations (validated by MLPerf):

ComparisonSpeedupSource
H100 vs A100 (BERT)~2-3xNVIDIA MLPerf Blog
A100 vs V100 (language models)~2-2.5xLambda Labs Benchmarks

The exact compute time matters less than the ratio. What you need to know: is your network time significantly larger than your compute time?


Step 5: Calculate GPU Utilization

Now put it together:

Total iteration time = GPU compute time + AllReduce time

GPU utilization = GPU compute time ÷ Total iteration time

Example calculation:

Let's say your GPU compute time is 150 ms (measured or estimated from benchmarks).

With 1 GbE (AllReduce = 4,000 ms for 440 MB):

GPU utilization = 150 ÷ (150 + 4,000) = 150 ÷ 4,150 = 3.6%

Your GPUs are 96% idle, waiting on the network.

With 100 GbE RDMA (AllReduce = 40 ms):

GPU utilization = 150 ÷ (150 + 40) = 150 ÷ 190 = 79%

The same GPUs are now productive 79% of the time.


Step 6: Calculate Training Time Impact

Once you know your iteration time, multiply by the number of iterations:

Total training time = Time per iteration × Number of iterations

Example: 100,000 iterations (typical for fine-tuning)

Assuming 150 ms GPU compute + network time from your setup:

NetworkAllReduceTotal/IterationTraining Time
1 GbE4,000 ms4,150 ms115 hours (4.8 days)
100 GbE RDMA40 ms190 ms5.3 hours

That's ~22x faster just by upgrading the network.

The iteration count depends on your task. For reference:

  • Fine-tuning: 10,000 - 100,000 iterations
  • Pre-training: 100,000 - 1,000,000+ iterations

See Hugging Face Training Documentation for typical training configurations.


Quick Reference: AllReduce Time by Network

Use this table to estimate your AllReduce time based on gradient size and network:

Gradient Size1 GbE (~110 MB/s)10 GbE (~1.1 GB/s)100 GbE RDMA (~11 GB/s)
100 MB (ResNet-50)0.9 sec90 ms9 ms
440 MB (BERT-base)4.0 sec400 ms40 ms
1.4 GB (BERT-large)13 sec1.3 sec130 ms
6 GB (GPT-2 1.5B)55 sec5.5 sec550 ms

Key takeaways:

  • 1 GbE is not viable for any serious distributed training
  • 10 GbE is marginal — still seconds of wait time for larger models
  • 100 GbE RDMA is the minimum for keeping GPUs productive

For model parameter counts and architecture details, refer to:


The Formula

Network Bottleneck Formula

Rule of thumb: If GPU utilization is below 50%, your network is the bottleneck.

When to worry:

  • Network time > Compute time → You're network-bound
  • Network time > 10x Compute time → Severely network-bound
  • GPU utilization < 20% → You're paying for idle GPUs

The Fix: High-Speed RDMA Networking

If the math shows your network is the problem, the solution is high-speed RDMA networking:

  • NICs: NVIDIA ConnectX-6 or ConnectX-7 (100-400 GbE)
  • Protocol: RoCE v2 (RDMA over Converged Ethernet) or InfiniBand
  • Why RDMA: Bypasses the CPU, enables zero-copy transfers, delivers 10-50x lower latency than TCP

We've written a detailed walkthrough of what this looks like on Kubernetes in Dual-network RDMA on GH200, and a full case study at From 10x Slower to Line-Rate: Building RDMA-Enabled Kubernetes for HPC and Distributed GPU Training.


Summary

Before optimizing your model, your batch size, or your learning rate—check your network:

  1. Calculate gradient size: parameters × 4 bytes (FP32)
  2. Know network throughput: 1 GbE ≈ 110 MB/s, 100 GbE ≈ 11 GB/s
  3. Calculate AllReduce time: gradient size ÷ throughput
  4. Estimate GPU compute time: forward + backward pass
  5. Calculate GPU utilization: compute ÷ (compute + network)

If your GPUs are spending more time waiting than computing, no amount of software optimization will help. You need faster networking.


Have questions about your distributed training setup? Schedule a call—we're happy to help you diagnose the bottleneck.

{/ TODO: review FAQ answers before publishing — drafted from post content, refine for positioning. /}

Frequently Asked Questions

How do I know if my distributed training is bottlenecked by the network?

Estimate per-iteration AllReduce time (gradient size ÷ network throughput) and compare it to per-iteration GPU compute time. If AllReduce is larger than compute, you're network-bound. A quick rule of thumb: if measured GPU utilization during multi-node training is under 50%, the network is almost certainly the bottleneck.

How big is the AllReduce payload in distributed training?

It equals your model's full gradient: number of parameters × bytes per parameter (4 for FP32, 2 for FP16/BF16). BERT-base is about 440 MB in FP32; BERT-large is 1.4 GB; a 1.5B-parameter model is 6 GB. That entire payload must cross the network every iteration, so network throughput directly determines training speed.

Why is 1 GbE or 10 GbE not enough for GPU training?

At ~110 MB/s effective throughput, 1 GbE needs 4 seconds to AllReduce a 440 MB BERT-base gradient — while the GPU compute for that iteration is under 200 ms. Even 10 GbE leaves GPUs idle for hundreds of milliseconds per step. You need 100 GbE with RDMA (or better) to keep modern GPUs productive in multi-node training.

What does RDMA actually change vs plain TCP for training?

RDMA (RoCE v2 or InfiniBand) bypasses the CPU and kernel, enables zero-copy GPU-to-GPU transfers via GPUDirect, and delivers 10-50x lower latency and significantly higher effective bandwidth than TCP on the same physical link. On a 100 GbE link, that's the difference between ~9 GB/s with TCP and ~11 GB/s with RDMA — plus massively lower per-message overhead.

How much faster can distributed training get once the network is fixed?

It depends on how network-bound you are today. In our engagements, moving from TCP to properly configured RDMA typically delivers 3-8x higher training throughput. For severely network-bound setups (e.g. BERT-scale models on 1 GbE), the gap between old and new networking can be 20x or more on end-to-end training wall-clock time.


· 6 min read

If you're building a multi-node GPU cluster for distributed training, you've probably run into a confusing mess of terminology — NVLink, NVSwitch, InfiniBand, RoCE, GPUDirect. Half the blog posts out there mix these up, and vendor documentation assumes you already know what you're doing.

So let's sort this out.

· 8 min read

I was lucky to work on a computer vision setup which involved NVIDIA RTX 6000 GPUs, Mellanox ONYX switches, and high-resolution cameras. The system was designed for real-time video capture and processing, pushing massive amounts of data through our network infrastructure. Here's a gist of debugging Mellanox switch metrics that led to some surprising discoveries about network traffic flow.

When monitoring network equipment like switches, routers, or network cards, you'll constantly encounter two metrics: RX and TX. These simple abbreviations are fundamental to understanding how data flows through your network, yet they often cause confusion. Let's demystify them with real-world examples from our production Mellanox ONYX switch. If you're operating GPU infrastructure, this is exactly the kind of low-level visibility you need — it's a recurring theme in our GPU monitoring and observability engagements.

{/ truncate /}

The Basics: What Do RX and TX Mean?

RX = Receive - Data coming INTO a network port TX = Transmit - Data going OUT of a network port

Think of each network port like a doorway. RX counts everyone walking in, while TX counts everyone walking out. Simple enough, right? The confusion often comes when trying to understand what these patterns mean for your specific setup.

Real Network Example: Mellanox ONYX Switch Analysis

Let's look at actual output from a production Mellanox ONYX switch to see how this works in practice. First, let's check which ports are actually up:

curl -k -b cookie.txt -X POST https://192.168.3.5/admin/launch?script=json&template=json-request&action=json-login \
-H "Content-Type: application/json" \
-d '{
"cmd": "show interfaces ethernet description",
"execution_type": "sync"
}'

Output shows our active ports:

{
"Eth1/1": {
"Operational state": "Up",
"Speed": "25G"
},
"Eth1/7": {
"Operational state": "Up",
"Speed": "25G"
},
"Eth1/19": {
"Operational state": "Up",
"Speed": "100G"
},
"Eth1/21": {
"Operational state": "Up",
"Speed": "100G"
}
}

Reading Traffic Patterns from Real Data

Now let's examine the actual traffic statistics:

"Eth1/7": {
"Rx": {
"packets": "6078365878",
"bytes": "46755253366062",
"packets Jumbo": "6047872646"
},
"Tx": {
"packets": "21008230",
"bytes": "3877663298"
}
}

This port receives 289× more data than it sends - classic collector pattern!

"Eth1/19": {
"Rx": {
"packets": "5619576",
"bytes": "370489758"
},
"Tx": {
"packets": "41154237203",
"bytes": "316279712716706"
}
}

This port transmits 7,324× more packets than it receives - a massive distribution hub!

Why Direction Matters: Full-Duplex Explained

Every network port has two independent data paths. When we see "100G" port speed, that means:

  • 100 Gbps receiving capacity AND
  • 100 Gbps transmitting capacity
  • Total theoretical throughput: 200 Gbps bidirectional

Common Misconceptions Revealed

Initially, you might assume camera-to-server traffic would look like:

  • Camera ports: High TX (sending video)
  • Server ports: High RX (receiving video)

But our real data shows the opposite! Here's the actual traffic flow:

Port 7 (25G) ──RX(6B packets)──> Switch ──TX(41B packets)──> Port 19 (100G)
└──TX(2.6B packets)──> Port 21 (100G)

This suggests Port 7 is receiving from an aggregation point, and Ports 19/21 are distributing to multiple endpoints. The same "don't just look at the packets, look at the fabric" mindset applies when you're tracing AllReduce traffic — see How to calculate if your network is bottlenecking distributed training for the GPU-training equivalent.

Monitoring in Action: Getting Real-Time Metrics

Here's how to monitor these metrics on your switch:

# Get current counters for all interfaces
curl -k -b cookie.txt -X POST https://192.168.3.5/admin/launch?script=json&template=json-request&action=json-login \
-H "Content-Type: application/json" \
-d '{
"cmd": "show interfaces ethernet counters",
"execution_type": "sync"
}'

Calculating Actual Utilization

With the data we collected, let's calculate real bandwidth usage:

Port 19 (100G capacity):

  • TX: 316,279,712,716,706 bytes total
  • If accumulated over 30 days: ~97 Gbps average
  • Near maximum capacity!

Port 7 (25G capacity):

  • RX: 46,755,253,366,062 bytes total
  • If accumulated over 30 days: ~14.4 Gbps average
  • Using 58% of link capacity

Key Metrics to Monitor

From our actual switch output, focus on these fields:

{
"Primary Metrics": {
"packets": "Total packet count",
"bytes": "Total byte count"
},
"Health Indicators": {
"error packets": "0",
"discard packets": "0",
"fcs errors": "0"
},
"Traffic Types": {
"unicast packets": "41123912532",
"multicast packets": "30090769",
"broadcast packets": "233902"
}
}

Practical Monitoring Script

To continuously monitor RX/TX rates:

# Poll every 30 seconds and calculate rate
PREV_RX=0
PREV_TX=0

while true; do
# Get current counters (example for Port 19)
CURR_RX=$(curl -sk ... | jq '.["Eth1/19"][0]["Rx"][0]["bytes"]')
CURR_TX=$(curl -sk ... | jq '.["Eth1/19"][1]["Tx"][0]["bytes"]')

# Calculate rate in Mbps
RX_RATE=$(( ($CURR_RX - $PREV_RX) * 8 / 30 / 1000000 ))
TX_RATE=$(( ($CURR_TX - $PREV_TX) * 8 / 30 / 1000000 ))

echo "Port 19: RX=${RX_RATE} Mbps, TX=${TX_RATE} Mbps"

PREV_RX=$CURR_RX
PREV_TX=$CURR_TX
sleep 30
done

The PTP Clue: Understanding the Context

Our switch configuration reveals another important detail:

"show running-config" output:
##
## PTP protocol
##
protocol ptp
interface ethernet 1/1 ptp enable
interface ethernet 1/7 ptp enable
interface ethernet 1/19 ptp enable

PTP (Precision Time Protocol) on all ports indicates this is a video production environment where precise timing synchronization is critical for frame-accurate video capture.

Key Takeaways from Real Data

  1. Don't assume traffic direction - Our "video" ports showed opposite patterns than expected
  2. Jumbo frames indicate video - 6 billion jumbo packets on Port 7 suggest video traffic
  3. 100G ports as distributors - Both 100G ports primarily transmit, indicating fan-out architecture
  4. Monitor both directions - Full-duplex means both paths matter for capacity planning
  5. Context matters - PTP configuration revealed this was video production, explaining the traffic patterns

Remember: These RX/TX metrics are always from the port's perspective. When troubleshooting, physically trace cables or clear counters to see fresh traffic patterns rather than historical accumulation. If you're trying to correlate switch-side counters with distributed training throughput end-to-end, our distributed training optimization work routinely starts at exactly this layer.


Pro tip: If your switch API is slow (>1 second response time), use SNMP polling instead - it's 10-100x faster for retrieving interface counters!

{/ TODO: review FAQ answers before publishing — drafted from post content, refine for positioning. /}

Frequently Asked Questions

What do RX and TX mean on a network switch?

RX (receive) counts traffic coming into a switch port from whatever is cabled to it, and TX (transmit) counts traffic leaving the port. Both counters are from the switch port's point of view, so a server sending data to the switch shows as RX on the switch port and TX on the server's NIC. Every port is full-duplex, so both can run at line rate simultaneously.

Why do my switch ports show such asymmetric RX vs TX traffic?

Asymmetry is normal and tells you the role of the port in your topology. Ports aggregating data from many sources (collectors) show RX far larger than TX. Uplinks and fan-out ports that push data to multiple endpoints show TX far larger than RX. For GPU training clusters, well-behaved AllReduce traffic tends to be balanced — a persistent imbalance during training usually indicates a topology or NCCL ring issue.

How do I calculate actual bandwidth utilization from byte counters?

Sample the byte counter twice at a known interval, take the delta, convert to bits, and divide by the interval to get bps. A simple 30-second polling loop gives you a bandwidth-over-time view. Compare to the port's rated capacity (e.g. 100 Gbps) to get utilization. For short bursts, poll every second or use sFlow/IPFIX — accumulated counters hide burstiness.

Are RX/TX counters reliable for detecting network problems?

Byte and packet counters show throughput, not health. Pair them with error counters (CRC errors, discards, pause frames, ECN marks) and link-state events. A port at 80% utilization with zero errors is healthy; a port at 20% utilization with rising discards and pause frames is a much bigger problem. For RDMA/RoCE fabrics, PFC and ECN counters matter more than raw bandwidth.

Should I use REST/API or SNMP to poll switch counters?

For high-frequency polling use SNMP — it's typically 10-100x faster than vendor REST APIs and designed for exactly this workload. Use REST/API when you need structured data (e.g. per-queue counters) or to drive automation. Many monitoring stacks combine both: SNMP for continuous counter scraping, REST for configuration and richer topology metadata.