Skip to main content
Service

GPU Networking & RDMA Consulting

The network connecting your GPUs determines whether distributed training scales linearly or plateaus. We design, implement, and optimize RDMA network fabrics for GPU clusters, turning the interconnect from a bottleneck into an enabler.

400 Gb/sPer-Port Bandwidth
<2 usRDMA Latency
ZeroCPU Overhead

Why GPU Networking Is Different

Traditional data center networking was designed for web applications and databases where latency requirements are measured in milliseconds and bandwidth needs are modest. GPU cluster networking operates in a fundamentally different regime. Distributed training workloads generate burst traffic patterns where all nodes simultaneously transmit gradient data during collective operations, demanding full bisection bandwidth from the network fabric. A single congested link or misconfigured switch port can degrade the performance of an entire training job.

The difference between a properly configured RDMA network and a standard TCP/IP network is not incremental. It is transformational. RDMA enables GPU-to-GPU transfers at wire rate with sub-microsecond latency and zero CPU involvement. Without it, every inter-node transfer requires multiple memory copies through the CPU, adding hundreds of microseconds of latency and consuming CPU cycles that should be feeding the GPUs. This is why networking expertise is the single most impactful investment you can make in GPU cluster performance.

InfiniBand vs. RoCE: Choosing the Right Fabric

The first architectural decision for any GPU cluster network is choosing between InfiniBand and RDMA over Converged Ethernet (RoCE). Both provide RDMA capability, but they differ significantly in deployment complexity, cost, and operational characteristics.

CharacteristicInfiniBandRoCE v2
Lossless guaranteeBuilt-in credit-based flow controlRequires PFC configuration
Congestion managementNative, automaticECN/DCQCN must be configured
Bandwidth (current gen)400 Gb/s (NDR)400 Gb/s (ConnectX-7)
Switch ecosystemNVIDIA Quantum onlyMultiple vendors
Operational complexityRequires IB subnet managerStandard Ethernet operations
CostHigher (dedicated fabric)Lower (shared Ethernet)

InfiniBand provides guaranteed lossless transport through credit-based flow control, making it inherently reliable for RDMA. RoCE runs over standard Ethernet switches, offering lower cost and operational familiarity, but requires careful configuration of Priority Flow Control and ECN to achieve lossless behavior. We help you choose the right fabric based on your scale, budget, and operational capabilities, then implement it correctly.

What We Implement

RDMA Network Setup

A working RDMA network requires correct configuration at every layer from the NIC firmware through the switch fabric to the host operating system. We configure ConnectX NIC firmware and driver parameters for optimal RDMA performance, set up RoCE v2 with proper GID indexes, traffic class, and DSCP marking, configure PFC (Priority Flow Control) on switches with appropriate pause thresholds to prevent packet loss without causing head-of-line blocking, tune ECN (Explicit Congestion Notification) and DCQCN parameters for proactive congestion management, validate end-to-end RDMA connectivity with perftest tools, and set MTU to 9000 bytes (jumbo frames) across the entire path for maximum throughput.

GPUDirect RDMA

GPUDirect RDMA enables network adapters to directly read from and write to GPU memory, completely bypassing CPU memory. This eliminates two memory copies per transfer and reduces latency by an order of magnitude. We install and configure the nvidia-peermem kernel module, verify PCIe topology to ensure GPUs and NICs share the same PCIe root complex for optimal DMA performance, configure NCCL to use GPUDirect with the appropriate NET_GDR_LEVEL setting, and validate GPU-to-GPU RDMA bandwidth with NCCL tests and real workloads. On systems where GPU and NIC are on different PCIe trees, we evaluate the performance tradeoff and configure NCCL accordingly.

Network Fabric Design

For new GPU cluster builds, we design the network fabric from scratch. Our designs typically use leaf-spine or fat-tree topologies that provide full bisection bandwidth, rail-optimized network layouts where each GPU connects to a dedicated network rail for maximum per-GPU bandwidth, multi-rail configurations on platforms like HGX and DGX that have multiple NICs per node, adaptive routing on InfiniBand or ECMP on Ethernet to distribute traffic across multiple paths, and dedicated compute and storage networks to prevent training traffic from competing with data loading. We size the fabric based on the expected communication patterns of your workloads, ensuring that collective operations like AllReduce can execute at full bandwidth across the cluster.

NVIDIA Spectrum-X

For Ethernet-based GPU clusters, NVIDIA Spectrum-X combines Spectrum-4 switches with BlueField-3 DPUs to deliver InfiniBand-class performance over Ethernet. We design and deploy Spectrum-X fabrics including switch configuration with adaptive routing and congestion control, BlueField-3 DPU setup for hardware-accelerated RoCE, isolation between tenants on multi-tenant GPU clusters, and integration with existing Ethernet infrastructure. Spectrum-X is particularly valuable for cloud providers and enterprises that want RDMA performance without the operational overhead of a separate InfiniBand fabric.

NCCL Network Tuning

NCCL is the bridge between your training framework and the network. Proper NCCL configuration is essential to leverage the full capability of your RDMA fabric. We tune NCCL_IB_HCA to pin NCCL to the correct network interfaces with proper affinity, configure NCCL_IB_GID_INDEX for RoCE v2 operation with the right GID, set NCCL_ALGO and NCCL_PROTO based on the cluster topology and message size distribution, enable NCCL_IB_ADAPTIVE_ROUTING on InfiniBand fabrics that support it, and tune buffer sizes and thread counts for the specific GPU and NIC combination. Every environment is different. We do not apply generic configurations. We measure and tune for your specific hardware.

Technologies We Work With

InfiniBand NDRRoCE v2GPUDirect RDMAConnectX-6/7BlueField-3Spectrum-XQuantum-2NCCLPFC/ECNDCQCNSR-IOVMultus

Related Resources

Need Help with GPU Networking?

Whether you are designing a new RDMA fabric or troubleshooting an existing one, we bring hands-on expertise to get your GPU network running at full bandwidth.

Schedule a Call