Distributed Training Optimization
Expert NCCL tuning, RDMA configuration, InfiniBand and RoCE optimization for multi-node GPU training. We help teams achieve linear scaling across GPU nodes by eliminating the network bottlenecks that cause distributed training to underperform.
Why Multi-Node Training Underperforms
Most organizations moving from single-GPU to multi-node distributed training expect linear performance gains. Instead, they encounter a frustrating reality: scaling from 8 to 64 GPUs delivers only a 2-3x speedup rather than the expected 8x improvement. The problem is almost never the GPUs themselves. It is the network.
Distributed training frameworks like PyTorch DDP, FSDP, DeepSpeed, and Megatron-LM rely on collective communication operations (AllReduce, AllGather, ReduceScatter) to synchronize gradients and parameters across nodes. When the network fabric connecting those nodes is misconfigured, under-provisioned, or running over TCP instead of RDMA, these collective operations become the dominant bottleneck. GPUs sit idle waiting for gradient synchronization while expensive compute time is wasted.
Common symptoms include GPU utilization dropping below 50% during multi-node training, NCCL timeout errors, jobs hanging during AllReduce operations, and training throughput that barely improves as you add more nodes. These issues stem from a set of well-understood but frequently misdiagnosed root causes.
The Root Causes
- NCCL falling back to TCP sockets instead of using RDMA verbs, reducing bandwidth by 5-10x
- Missing or misconfigured PFC (Priority Flow Control) causing packet drops on RoCE networks
- Incorrect ECN (Explicit Congestion Notification) thresholds leading to congestion-induced retransmits
- PCIe topology misalignment where GPUs and NICs are on different NUMA nodes, adding microseconds of latency per transfer
- GPUDirect RDMA not enabled, forcing all inter-node traffic through CPU memory copies
- Suboptimal NCCL algorithm and protocol selection for the given network topology
- Switch-level misconfigurations including incorrect MTU, missing DSCP trust, or improper QoS policies
Our Approach to Distributed Training Optimization
We take a systematic, bottom-up approach to diagnosing and resolving distributed training performance issues. Rather than guessing at configurations, we measure, analyze, and validate at each layer of the stack.
Network Fabric Assessment
We start by profiling the physical and logical network topology. This includes verifying link speeds and error counters on every NIC and switch port, testing raw RDMA bandwidth with perftest tools (ib_write_bw, ib_send_bw), validating PFC and ECN configuration at the switch level, and mapping the PCIe topology to understand GPU-NIC affinity. This assessment alone frequently reveals the primary bottleneck. We have seen cases where a single misconfigured switch port was silently downgrading an entire rack to 25 Gbps instead of 100 Gbps.
NCCL Tuning
NCCL (NVIDIA Collective Communications Library) has dozens of environment variables that control its behavior. The right configuration depends on your specific hardware topology, network fabric, and workload characteristics. We tune NCCL_ALGO to select the optimal algorithm (Ring, Tree, or CollnetDirect) for your topology, configure NCCL_PROTO to choose between Simple, LL, and LL128 protocols based on message sizes, set NCCL_IB_HCA and NCCL_IB_GID_INDEX to ensure NCCL uses the correct network interfaces, adjust NCCL_BUFFSIZE and NCCL_NTHREADS for maximum throughput, and enable NCCL_NET_GDR_LEVEL for GPUDirect RDMA when hardware supports it. We validate each change with NCCL tests (nccl-tests) and real training workloads to confirm measurable improvement.
RDMA and GPUDirect Configuration
Enabling RDMA properly requires coordination across multiple layers. We configure the Mellanox/NVIDIA ConnectX NICs with optimal firmware settings, set up RDMA-CM or raw verbs depending on the deployment, enable and validate GPUDirect RDMA for zero-copy GPU-to-GPU transfers, configure peer memory modules and verify GDR copy functionality, and tune RoCE-specific parameters including GID indexes, traffic class, and DSCP marking. For InfiniBand environments, we configure subnet managers, partition keys, and adaptive routing. For RoCE environments, we ensure lossless Ethernet is properly configured end-to-end.
Topology-Aware Optimization
Modern GPU servers like the DGX H100 and HGX platforms have complex internal topologies with NVSwitch, NVLink, and PCIe interconnects. We map this topology and ensure that NCCL leverages it optimally. This includes configuring intra-node communication to use NVLink/NVSwitch rather than PCIe, aligning inter-node communication with the optimal NIC-GPU pairs based on PCIe affinity, and setting CUDA visible devices and process placement to match the physical topology. On platforms like the GH200 with dual-network configurations, we configure NCCL to use separate network rails for different collective operations, maximizing aggregate bandwidth.
Results We Deliver
Our optimization work consistently delivers dramatic improvements in distributed training performance. In a recent engagement with a computer vision company running multi-node training on bare-metal Kubernetes with ConnectX-6 NICs, we achieved an 8.5x improvement in training throughput and a 10x reduction in inter-node communication latency by properly configuring GPUDirect RDMA over RoCE.
Typical results across our engagements include training throughput improvements of 3-8x when moving from TCP to properly configured RDMA, inter-node latency reduction from hundreds of microseconds to single-digit microseconds with GPUDirect RDMA, scaling efficiency above 90% across 8-16 nodes (up from 30-50% before optimization), and elimination of NCCL timeout errors and job failures caused by network issues.
Read our detailed case study on how we achieved these results:
Case Study: 8.5x Faster Distributed Training with RDMA on Kubernetes
Technologies We Work With
Related Resources
Struggling with Multi-Node Training Performance?
Let us diagnose your distributed training bottlenecks. We will analyze your network fabric, NCCL configuration, and GPU topology to deliver measurable throughput improvements.
Schedule a Call