Question 1

What is RDMA and why does it matter for GPU training?

Accepted Answer

RDMA lets NICs read and write remote memory directly, bypassing the CPU and kernel. Combined with GPUDirect RDMA, it enables zero-copy GPU-to-GPU transfers across nodes — 5-10x higher bandwidth and an order-of-magnitude lower latency than TCP on the same hardware.

Question 2

Should I use InfiniBand or RoCE?

Accepted Answer

Both deliver RDMA performance. InfiniBand is a purpose-built lossless fabric standard in DGX SuperPOD deployments. RoCE v2 runs RDMA over Ethernet — cheaper, more flexible, and the right choice for most cloud, colo, and bare-metal clusters when PFC and ECN are configured correctly.

Question 3

Do I need PFC and ECN for RoCE?

Accepted Answer

Yes, if you want lossless RoCE v2. PFC prevents packet drops during microbursts, ECN signals congestion before buffers overflow. Without these configured end-to-end — NICs, switches, and host settings — RoCE falls over under load and NCCL silently underperforms.

Question 4

What is GPUDirect RDMA?

Accepted Answer

GPUDirect RDMA lets the NIC DMA directly to and from GPU memory without an intermediate CPU copy. It requires matched driver support, peer-memory modules, and PCIe affinity between GPU and NIC. When enabled, inter-node GPU communication drops to single-digit microseconds.

Question 5

Can you fix existing GPU networking problems?

Accepted Answer

Yes. A lot of our work is forensic: NCCL falling back to TCP, PFC dropping packets under load, GPU-NIC PCIe affinity mismatches, wrong NCCL_IB_HCA, incorrect DSCP marking. We bring perftest, nccl-tests, and switch counter experience to find and fix these without replacing hardware.

Question 6

Do I need two NICs per GPU node?

Accepted Answer

For production distributed training, yes. One NIC for Kubernetes management (pod CNI, API traffic, metrics) and one or more RDMA-capable NICs dedicated to NCCL/training traffic via Multus secondary networks. Single-NIC works for POCs but degrades at scale.

	InfiniBand	RoCE v2
Lossless guarantee	Built-in (credit-based)	Requires PFC config
Congestion handling	Native	ECN/DCQCN must be tuned
Bandwidth	400 Gb/s (NDR)	400 Gb/s (ConnectX-7)
Switch vendors	NVIDIA Quantum only	Multiple vendors
Ops complexity	Needs subnet manager	Standard Ethernet ops
Cost	Higher	Lower

GPU Networking & RDMA

What We Do

Proof

InfiniBand vs RoCE

How We Work

Technologies

Related

Frequently Asked Questions