Founder, BaaZ · Apache Software Foundation Member · NVIDIA NCP-AII
If you're building a multi-node GPU cluster for distributed training, you've probably run into a confusing mess of terminology — NVLink, NVSwitch, InfiniBand, RoCE, GPUDirect. Half the blog posts out there mix these up, and vendor documentation assumes you already know what you're doing.
Founder, BaaZ · Apache Software Foundation Member · NVIDIA NCP-AII
I was lucky to work on a computer vision setup which involved NVIDIA RTX 6000 GPUs, Mellanox ONYX switches, and high-resolution cameras. The system was designed for real-time video capture and processing, pushing massive amounts of data through our network infrastructure. Here's a gist of debugging Mellanox switch metrics that led to some surprising discoveries about network traffic flow.
When monitoring network equipment like switches, routers, or network cards, you'll constantly encounter two metrics: RX and TX. These simple abbreviations are fundamental to understanding how data flows through your network, yet they often cause confusion. Let's demystify them with real-world examples from our production Mellanox ONYX switch. If you're operating GPU infrastructure, this is exactly the kind of low-level visibility you need — it's a recurring theme in our GPU monitoring and observability engagements.
RX = Receive - Data coming INTO a network port
TX = Transmit - Data going OUT of a network port
Think of each network port like a doorway. RX counts everyone walking in, while TX counts everyone walking out. Simple enough, right? The confusion often comes when trying to understand what these patterns mean for your specific setup.
Real Network Example: Mellanox ONYX Switch Analysis
Let's look at actual output from a production Mellanox ONYX switch to see how this works in practice. First, let's check which ports are actually up:
Initially, you might assume camera-to-server traffic would look like:
Camera ports: High TX (sending video)
Server ports: High RX (receiving video)
But our real data shows the opposite! Here's the actual traffic flow:
Port 7 (25G) ──RX(6B packets)──> Switch ──TX(41B packets)──> Port 19 (100G) └──TX(2.6B packets)──> Port 21 (100G)
This suggests Port 7 is receiving from an aggregation point, and Ports 19/21 are distributing to multiple endpoints. The same "don't just look at the packets, look at the fabric" mindset applies when you're tracing AllReduce traffic — see How to calculate if your network is bottlenecking distributed training for the GPU-training equivalent.
PTP (Precision Time Protocol) on all ports indicates this is a video production environment where precise timing synchronization is critical for frame-accurate video capture.
Don't assume traffic direction - Our "video" ports showed opposite patterns than expected
Jumbo frames indicate video - 6 billion jumbo packets on Port 7 suggest video traffic
100G ports as distributors - Both 100G ports primarily transmit, indicating fan-out architecture
Monitor both directions - Full-duplex means both paths matter for capacity planning
Context matters - PTP configuration revealed this was video production, explaining the traffic patterns
Remember: These RX/TX metrics are always from the port's perspective. When troubleshooting, physically trace cables or clear counters to see fresh traffic patterns rather than historical accumulation. If you're trying to correlate switch-side counters with distributed training throughput end-to-end, our distributed training optimization work routinely starts at exactly this layer.
Pro tip: If your switch API is slow (>1 second response time), use SNMP polling instead - it's 10-100x faster for retrieving interface counters!
{/ TODO: review FAQ answers before publishing — drafted from post content, refine for positioning. /}
Frequently Asked Questions
What do RX and TX mean on a network switch?
+
RX (receive) counts traffic coming into a switch port from whatever is cabled to it, and TX (transmit) counts traffic leaving the port. Both counters are from the switch port's point of view, so a server sending data to the switch shows as RX on the switch port and TX on the server's NIC. Every port is full-duplex, so both can run at line rate simultaneously.
Why do my switch ports show such asymmetric RX vs TX traffic?
+
Asymmetry is normal and tells you the role of the port in your topology. Ports aggregating data from many sources (collectors) show RX far larger than TX. Uplinks and fan-out ports that push data to multiple endpoints show TX far larger than RX. For GPU training clusters, well-behaved AllReduce traffic tends to be balanced — a persistent imbalance during training usually indicates a topology or NCCL ring issue.
How do I calculate actual bandwidth utilization from byte counters?
+
Sample the byte counter twice at a known interval, take the delta, convert to bits, and divide by the interval to get bps. A simple 30-second polling loop gives you a bandwidth-over-time view. Compare to the port's rated capacity (e.g. 100 Gbps) to get utilization. For short bursts, poll every second or use sFlow/IPFIX — accumulated counters hide burstiness.
Are RX/TX counters reliable for detecting network problems?
+
Byte and packet counters show throughput, not health. Pair them with error counters (CRC errors, discards, pause frames, ECN marks) and link-state events. A port at 80% utilization with zero errors is healthy; a port at 20% utilization with rising discards and pause frames is a much bigger problem. For RDMA/RoCE fabrics, PFC and ECN counters matter more than raw bandwidth.
Should I use REST/API or SNMP to poll switch counters?
+
For high-frequency polling use SNMP — it's typically 10-100x faster than vendor REST APIs and designed for exactly this workload. Use REST/API when you need structured data (e.g. per-queue counters) or to drive automation. Many monitoring stacks combine both: SNMP for continuous counter scraping, REST for configuration and richer topology metadata.
Founder, BaaZ · Apache Software Foundation Member · NVIDIA NCP-AII
When monitoring GPU infrastructure, you might occasionally notice something peculiar: your high-end GPUs connected to PCIe 5.0 slots are operating at PCIe Gen 1 speeds. Before raising any alarms, let's walk through a systematic debugging approach that reveals this is often expected behavior rather than a configuration issue.
During routine infrastructure monitoring on our dual-GPU setup, we observed an interesting discrepancy. Running a simple nvidia-smi query revealed our GPUs were operating at PCIe Gen 1:
pcie.link.gen.max: Maximum supported PCIe generation
The performance implications seemed significant at first glance. PCIe Gen 1 provides approximately 250 MB/s per lane, while PCIe Gen 4 delivers around 2 GB/s per lane. For a standard x16 configuration, this translates to:
Current state (Gen 1): ~4 GB/s total bandwidth
Potential capacity (Gen 4): ~32 GB/s total bandwidth
To ensure this wasn't a hardware misconfiguration, we proceeded with a thorough investigation using standard Linux diagnostic tools. This kind of low-level verification is one of the signals we rely on when building a GPU observability and reliability stack — a single downgraded link can otherwise hide behind green dashboards for weeks.
The output confirmed both GPUs were indeed operating at 2.5GT/s (Gen 1 speed), with the LinkStatus showing a downgraded state from their maximum capability.
After confirming the hardware configuration was sound, the explanation became clear: this is intelligent power management at work. Modern GPUs implement sophisticated power-saving mechanisms that dynamically adjust PCIe link speeds based on workload demands.
When GPUs are idle or under minimal load, they automatically scale down to PCIe Gen 1 to reduce power consumption. This behavior is particularly important in datacenter environments where energy efficiency directly impacts operational costs. The GPUs will automatically negotiate higher PCIe generations when computational demands require increased bandwidth.
The same "is it actually a problem?" question comes up for network counters too — see Understanding RX vs TX traffic direction for an example where the raw numbers look alarming but context reveals normal behavior.
This investigation highlights several important considerations for infrastructure monitoring:
Dynamic scaling is a feature, not a bug: PCIe generation scaling is an expected power management behavior in modern GPU systems.
Context matters in monitoring: Performance metrics should be evaluated in the context of current workload demands rather than absolute capabilities.
Systematic debugging pays dividends: Using multiple diagnostic tools (nvidia-smi, lspci, dmidecode) provides a complete picture of system state.
Power efficiency at scale: In production environments with hundreds or thousands of GPUs, these power optimizations can result in substantial energy savings during idle periods.
For teams managing GPU infrastructure, understanding these power management behaviors helps distinguish between actual configuration issues and normal operating states. When troubleshooting performance concerns, always verify PCIe generation under load conditions rather than idle states to get an accurate assessment of your system's operational characteristics.
{/ TODO: review FAQ answers before publishing — drafted from post content, refine for positioning and concision. /}
Frequently Asked Questions
Why does nvidia-smi show my GPU at PCIe Gen 1 when the slot is Gen 5?
+
Modern NVIDIA GPUs implement Active State Power Management and dynamically downshift the PCIe link to Gen 1 when the device is idle to save power. As soon as a workload hits the GPU, the link renegotiates up to its trained maximum generation. Seeing Gen 1 at idle on a Gen 5 slot is expected behavior, not a fault.
How do I confirm the GPU is running at full PCIe speed under load?
+
Run a workload that actually drives PCIe traffic (e.g. `nvidia-smi dmon`, an NCCL bandwidth test, or a training job) and poll `nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv` during the run. The link generation and width should rise to match the values reported by `pcie.link.gen.max` and `pcie.link.width.max`.
Which tools should I use to debug PCIe link state?
+
Use three complementary tools: `nvidia-smi` for the GPU's view of the link, `lspci -vv` (look at `LnkCap` vs `LnkSta`) for the PCIe controller's view, and `dmidecode -t slot` to confirm the physical slot capability. If all three agree on Gen 5 max and only the current state is Gen 1 at idle, this is power management, not a defect.
When is a downgraded PCIe link actually a problem?
+
It is a real problem when the link stays at a reduced generation or width under sustained GPU load. Persistent Gen 3 on a Gen 4/5-capable platform, an unexpected x8 width on an x16 slot, or frequent link retraining events in `dmesg` all indicate physical issues — riser cables, seating, firmware mismatches, or a failing slot — and deserve investigation.
Does PCIe power management affect distributed training performance?
+
Not meaningfully for well-tuned training jobs. NCCL AllReduce and gradient transfers generate enough sustained PCIe traffic to keep the link at its maximum generation. The exception is very bursty, small-message workloads where the link may transition in and out of low-power states — in that case you can pin the link via `nvidia-smi -pm` and BIOS/ACPI tuning.