Question 1

What is distributed training optimization?

Accepted Answer

It's the practice of tuning GPU networking, NCCL, and collective-communication paths so multi-node training scales near-linearly with node count — RDMA/RoCE configuration, GPUDirect RDMA, NCCL algorithm tuning, and topology-aware process placement to eliminate network-induced GPU idle time.

Question 2

How do I know if my multi-node training is network-bound?

Accepted Answer

If scaling from 8 to 64 GPUs delivers only 2-3x speedup instead of ~8x, or GPU utilization drops below 50% during AllReduce, the network is the bottleneck. Profiling with nccl-tests and per-iteration timing of AllReduce vs compute will confirm it.

Question 3

What is the difference between NCCL over TCP and NCCL over RDMA?

Accepted Answer

NCCL over TCP goes through the kernel networking stack. NCCL over RDMA (RoCE v2 or InfiniBand) bypasses the CPU, uses zero-copy GPU-to-GPU transfers via GPUDirect RDMA, and delivers 5-10x higher effective bandwidth and an order-of-magnitude lower latency.

Question 4

Do I need InfiniBand, or is RoCE enough?

Accepted Answer

Both work. InfiniBand is a lossless, purpose-built fabric standard in DGX SuperPOD deployments. RoCE v2 runs RDMA over Ethernet and achieves comparable throughput when configured correctly with PFC and ECN — often the better fit for cloud, colo, and bare-metal Kubernetes clusters.

Question 5

Can you fix distributed training issues without changing hardware?

Accepted Answer

Often, yes. Many underperforming clusters are misconfigured rather than under-provisioned: NCCL falling back to TCP, PFC/ECN disabled, wrong IB_HCA selection, NUMA-misaligned processes. We frequently deliver 2-5x speedups using existing NICs and switches.

Question 6

What results do BaaZ clients typically see?

Accepted Answer

3-8x higher training throughput moving from TCP to properly configured RDMA, inter-node latency dropping from hundreds of microseconds to single digits with GPUDirect, scaling efficiency above 90% across 8-16 nodes, and elimination of NCCL timeouts.

Distributed Training Optimization

What We Do

Proof

How We Work

Technologies

Related

Frequently Asked Questions