When should I use RDMA for distributed GPU training?

RDMA is appropriate when you're running multi-node distributed training (DDP, FSDP, DeepSpeed), gradient payloads exceed 100MB, GPU utilization during training is under 60%, and you control the hardware (on-prem, colo, bare metal cloud).

What is the difference between RoCE and InfiniBand for GPU training?

InfiniBand offers 200-400 Gb/s bandwidth with ~0.5-1μs latency but costs $15-40K for switches and requires specialized expertise. RoCE v2 provides 100 Gb/s with ~1-2μs latency at 30% of the cost, using familiar Ethernet infrastructure. For small deployments (2-4 nodes), RoCE offers 95% of InfiniBand's performance.

How does GPUDirect RDMA improve distributed training performance?

GPUDirect RDMA enables zero-copy transfers directly between GPU memory and the network, bypassing the CPU and system RAM entirely. This reduces AllReduce latency from ~100ms to ~10ms (10x improvement) and can increase training throughput by 8.5x or more.

Technical Case Study

8.5x Faster Distributed Training

RDMA on Bare Metal Kubernetes

How we helped a computer vision company deploy bare metal Kubernetes with GPUDirect RDMA over RoCE for high-performance ML workloads

10xAllReduce Latency ImprovementFrom ~100ms to ~10ms

8.5xTraining Throughput IncreaseFor BERT-sized models

78%GPU UtilizationUp from 35%

4 days → 11 hrsTime-to-Train ReductionFor production models

Executive Summary

A mid-sized computer vision company had invested in GPU infrastructure for their ML platform but hit a wall: distributed training jobs that should complete in hours were taking days. Their data science team was frustrated, GPU utilization was under 40%, and leadership was questioning the ROI of their hardware investment.

The root cause: network bottlenecks. Without RDMA, gradient synchronization between GPUs was crawling over TCP/IP, burning expensive GPU cycles waiting on the network. Their RTX A5000 and A5500 cards sat idle during AllReduce operations while the CPU copied data between system memory and network buffers.

We implemented a complete network redesign using GPUDirect RDMA over RoCE (RDMA over Converged Ethernet), integrated with their existing Kubernetes infrastructure through the NVIDIA Network Operator and Multus CNI.

The Challenge

Client Context

The client, a mid-sized computer vision company, had built an on-premises ML platform to train object detection and image segmentation models. Privacy requirements and data residency regulations made cloud training impractical for their most sensitive workloads.

Their Infrastructure

2 Dell Precision workstations: one with RTX A5000 (2 GPUs), one with RTX A5500 (2 GPUs)
Bare metal Kubernetes cluster (Ubuntu 24.04, K8s 1.29)
NVIDIA GPU Operator for device management
Calico CNI for pod networking
NFS for shared storage, local NVMe for scratch space
KAI Scheduler for GPU-aware job scheduling

The Problem

Training jobs that used all 4 GPUs across both nodes were painfully slow. A BERT-base fine-tuning job that benchmarked at 6 hours on a single 8-GPU cloud instance was taking over 50 hours on their 4-GPU cluster.

Symptom	Observed	Expected
GPU utilization during training	30-40%	>80%
AllReduce time per iteration	~400ms	<50ms
Network throughput during sync	~800 Mbps	>10 Gbps
CPU utilization during sync	85%+ (single core)	<10%

Root Cause Analysis

We identified three compounding issues:

No RDMA capability. The integrated NICs didn't support RDMA. Every gradient sync required: GPU memory → PCIe → System RAM → CPU (TCP/IP stack) → NIC → Wire → NIC → CPU → System RAM → PCIe → GPU memory. The CPU was in the critical path for every byte transferred.

No GPUDirect. Without GPUDirect RDMA, NCCL fell back to the Socket transport. Each AllReduce operation involved multiple memory copies and CPU intervention, adding 50-100μs of latency per operation.

Inadequate bandwidth. Even ignoring latency, 1GbE (125 MB/s) couldn't move a 400MB gradient payload fast enough. At theoretical maximum, that's 3.2 seconds per AllReduce—just for the network transfer.

Comparison diagram showing TCP/IP data path with multiple memory copies through CPU versus RDMA direct path bypassing CPU — TCP/IP Data Path vs RDMA Data Path

The math was clear: to make distributed training viable, they needed 100x bandwidth improvement and 10-50x latency reduction. That meant RDMA.

Solution Architecture

Technology Selection: RoCE vs InfiniBand

For RDMA, there are two main options: InfiniBand and RoCE (RDMA over Converged Ethernet). We evaluated both:

Factor	InfiniBand (HDR/NDR)	RoCE v2 (100GbE)
Bandwidth	200-400 Gb/s	100 Gb/s
Latency	~0.5-1 μs	~1-2 μs
Switch cost	$15-40K (IB switch)	$3-8K (DCB Ethernet)
NIC cost	~$1,500-3,000	~$500-1,000
Expertise required	Specialized	Familiar to network teams

Decision: RoCE v2

For a 2-4 node deployment, RoCE offers 95% of InfiniBand's performance at 30% of the cost. The slight latency penalty (1-2μs vs 0.5-1μs) is negligible for gradient payloads measured in hundreds of megabytes.

Hardware Specification

Component	Specification	Purpose
NIC	NVIDIA ConnectX-6 Dx 100GbE (dual-port)	RDMA-capable network interface
Switch	NVIDIA SN2201 or Dell S5248F-ON	DCB-capable Ethernet with PFC/ECN
Cabling	DAC (Direct Attach Copper) or 100GbE QSFP28	Node interconnect

Network Topology Design

We designed a physically separated network architecture with two distinct planes:

Physical network topology showing separate management and RDMA networks connecting GPU servers — Physical Network Topology

Management Network (existing)

Integrated NICs connected to the existing management switch. Handles Kubernetes control plane, SSH access, monitoring, and NFS traffic. No changes required.

RDMA Network (new)

ConnectX-6 Dx NICs connected to a dedicated DCB switch. Handles only GPU-to-GPU NCCL traffic. Flat L2 network with PFC/ECN enabled.

Why Flat L2 (No VLANs, No VXLAN)

No VLANs needed: With only 2 nodes on a dedicated switch, there's nothing to segment.
No VXLAN: Encapsulation overhead kills RDMA performance. VXLAN adds headers and processing that defeat the purpose of zero-copy transfers.
Simple PFC configuration: Priority Flow Control is easier to configure and debug on a flat network.

Kubernetes Integration

The Multi-Network Challenge

Kubernetes assumes a single network per pod. Our design requires pods to have two networks: the primary Calico network for Kubernetes services and a secondary RDMA network for NCCL traffic. This is where Multus CNI comes in.

eth0

Primary interface (Calico) for Kubernetes services, DNS, API server communication

net1

Secondary interface (RDMA) for NCCL collective operations

Component Stack

The complete Kubernetes stack for RDMA-enabled GPU training:

Component	Purpose
Calico	Primary CNI for pod networking
Multus CNI	Meta-CNI for multiple network interfaces
NVIDIA Network Operator	RDMA drivers, device plugin, secondary networks
whereabouts	IPAM for secondary network
NVIDIA GPU Operator	GPU drivers, device plugin
KAI Scheduler	Gang scheduling for distributed jobs

SR-IOV vs Host-Device: A Critical Decision

Approach	How It Works	Pros	Cons
Host-Device	Entire NIC moved into pod namespace	Simple, full performance	Exclusive access—one job per NIC
SR-IOV	Virtual Functions (VFs) carved from physical NIC	Multiple jobs share NIC	More complex setup, ~5% overhead

Decision: Host-Device with Dual Ports

For this 2-node deployment running one distributed training job at a time, host-device provides the simplest path. The dual-port ConnectX-6 Dx gives us two RDMA resources per node, allowing two concurrent RDMA-enabled jobs if needed.

Distributed Training Setup

Framework: Kubeflow Training Operator

For running distributed PyTorch jobs on Kubernetes, we deployed the Kubeflow Training Operator. It provides PyTorchJob CRD for distributed PyTorch training, automatic worker discovery, rendezvous coordination, and integration with KAI Scheduler for gang scheduling support.

Distributed training stack diagram showing Kubeflow Training Operator, PyTorchJob, and KAI Scheduler integration — Distributed Training Stack

NCCL Configuration for RoCE

NCCL (NVIDIA Collective Communication Library) must be explicitly configured to use the RDMA interface. Key environment variables include enabling the InfiniBand/RoCE path, specifying the ConnectX-6 device name, and setting the correct RoCEv2 GID index.

Critical: NCCL_IB_GID_INDEX

The GID index must match your RoCEv2 configuration. Index 3 is typical for RoCEv2 with IPv4. Run show_gids on the host to verify the correct index for your setup. Wrong GID index = silent fallback to TCP.

Results

Performance Benchmarks

We benchmarked before and after using nccl-tests (all_reduce_perf) and real training workloads:

Metric	Before (TCP)	After (RoCE)	Improvement
AllReduce latency (100MB)	~95 ms	~9 ms	10.5x
AllReduce bandwidth	~800 Mbps	~89 Gbps	111x
GPU utilization (training)	35%	78%	2.2x
BERT-base fine-tuning time	52 hours	6.1 hours	8.5x
Checkpoint save (2GB)	12 seconds	1.8 seconds	6.7x

Key Insight

The 8.5x training speedup was greater than expected. This is because reduced AllReduce time didn't just speed up synchronization—it allowed the GPUs to process more batches per unit time, compounding the gains.

Operational Impact

Model iteration time

Data scientists can now run 5-6 experiments per week instead of 1-2.

GPU ROI

Effective utilization more than doubled, improving the cost-per-trained-model significantly.

Team morale

No more babysitting multi-day training jobs or debugging mysterious slowdowns.

When to Use This Architecture

This solution is appropriate when:

You're running multi-node distributed training (DDP, FSDP, DeepSpeed)
Gradient payloads exceed 100MB (most modern models)
GPU utilization during training is under 60%
You control the hardware (on-prem, colo, bare metal cloud)

This solution is overkill when:

Training fits on a single node
You're doing inference only
You're on shared cloud infrastructure without RDMA support