Skip to main content
Service

GPU Networking & RDMA

The network between your GPUs is the single biggest performance lever in distributed training. A misconfigured switch port or missing PFC config silently kills throughput for the entire cluster. We design and implement RDMA networks that run at wire rate.

400 Gb/sPer-Port Bandwidth
<2 μsRDMA Latency
ZeroCPU Overhead

What We Do

  • InfiniBand fabric — Quantum switch deployment, subnet manager configuration, adaptive routing, fat-tree/dragonfly topology design, partition keys
  • RoCE v2 fabric — Lossless Ethernet with PFC, ECN/DCQCN tuning, leaf-spine design, ECMP multi-path, jumbo frames, DSCP trust
  • GPUDirect RDMA — Zero-copy GPU-to-GPU transfers bypassing CPU, peer memory module setup, GDR copy validation, firmware tuning
  • Switch configuration — Spectrum-X / Quantum switch deployment, port speed validation, error counter monitoring, QoS policies, MTU config
  • Network Operator on Kubernetes — NicClusterPolicy setup, Multus secondary networks, SR-IOV, RDMA device plugin. We contributed the global config feature
  • Dual-network architectures — Separate management and RDMA training networks, MACVLAN/IPVLAN secondary interfaces, network isolation for multi-tenant clusters

Proof

We configured GPUDirect RDMA over RoCE on bare-metal Kubernetes with ConnectX-6 NICs. Result: 10x inter-node latency reduction and 8.5x training throughput improvement over the previous TCP configuration.

We also deployed dual-network Kubernetes pods with RDMA on NVIDIA GH200 — separate management and training networks with working RDMA verbs.

InfiniBand vs RoCE

InfiniBandRoCE v2
Lossless guaranteeBuilt-in (credit-based)Requires PFC config
Congestion handlingNativeECN/DCQCN must be tuned
Bandwidth400 Gb/s (NDR)400 Gb/s (ConnectX-7)
Switch vendorsNVIDIA Quantum onlyMultiple vendors
Ops complexityNeeds subnet managerStandard Ethernet ops
CostHigherLower

How We Work

1

Assess

Audit link speeds, error counters, PFC/ECN, PCIe topology.

2

Design

Fabric topology, oversubscription, QoS, traffic separation.

3

Implement

Configure switches, NICs, RDMA, GPUDirect. Validate end-to-end.

4

Transfer

Network monitoring dashboards, runbooks, documentation.

Technologies

InfiniBandRoCE v2GPUDirect RDMAConnectX-6/7Spectrum-XQuantumNCCLNetwork OperatorMultusSR-IOVMACVLAN

Related

Frequently Asked Questions

What is RDMA and why does it matter for GPU training?

RDMA lets NICs read and write remote memory directly, bypassing the CPU and kernel. Combined with GPUDirect RDMA, it enables zero-copy GPU-to-GPU transfers across nodes — 5-10x higher bandwidth and an order-of-magnitude lower latency than TCP on the same hardware.

Should I use InfiniBand or RoCE?

Both deliver RDMA performance. InfiniBand is a purpose-built lossless fabric standard in DGX SuperPOD deployments. RoCE v2 runs RDMA over Ethernet — cheaper, more flexible, and the right choice for most cloud, colo, and bare-metal clusters when PFC and ECN are configured correctly.

Do I need PFC and ECN for RoCE?

Yes, if you want lossless RoCE v2. PFC prevents packet drops during microbursts, ECN signals congestion before buffers overflow. Without these configured end-to-end — NICs, switches, and host settings — RoCE falls over under load and NCCL silently underperforms.

What is GPUDirect RDMA?

GPUDirect RDMA lets the NIC DMA directly to and from GPU memory without an intermediate CPU copy. It requires matched driver support, peer-memory modules, and PCIe affinity between GPU and NIC. When enabled, inter-node GPU communication drops to single-digit microseconds.

Can you fix existing GPU networking problems?

Yes. A lot of our work is forensic: NCCL falling back to TCP, PFC dropping packets under load, GPU-NIC PCIe affinity mismatches, wrong NCCL_IB_HCA, incorrect DSCP marking. We bring perftest, nccl-tests, and switch counter experience to find and fix these without replacing hardware.

Do I need two NICs per GPU node?

For production distributed training, yes. One NIC for Kubernetes management (pod CNI, API traffic, metrics) and one or more RDMA-capable NICs dedicated to NCCL/training traffic via Multus secondary networks. Single-NIC works for POCs but degrades at scale.

Network Holding Back Your GPUs?

We'll profile your fabric, find the bottleneck, and fix it. RDMA, GPUDirect, switch configs — all of it.

Schedule a Call