Skip to main content

3 posts tagged with "infrastructure"

View All Tags

· 6 min read

If you're building a multi-node GPU cluster for distributed training, you've probably run into a confusing mess of terminology — NVLink, NVSwitch, InfiniBand, RoCE, GPUDirect. Half the blog posts out there mix these up, and vendor documentation assumes you already know what you're doing.

So let's sort this out.

· 8 min read

I was lucky to work on a computer vision setup which involved NVIDIA RTX 6000 GPUs, Mellanox ONYX switches, and high-resolution cameras. The system was designed for real-time video capture and processing, pushing massive amounts of data through our network infrastructure. Here's a gist of debugging Mellanox switch metrics that led to some surprising discoveries about network traffic flow.

When monitoring network equipment like switches, routers, or network cards, you'll constantly encounter two metrics: RX and TX. These simple abbreviations are fundamental to understanding how data flows through your network, yet they often cause confusion. Let's demystify them with real-world examples from our production Mellanox ONYX switch. If you're operating GPU infrastructure, this is exactly the kind of low-level visibility you need — it's a recurring theme in our GPU monitoring and observability engagements.

{/ truncate /}

The Basics: What Do RX and TX Mean?

RX = Receive - Data coming INTO a network port TX = Transmit - Data going OUT of a network port

Think of each network port like a doorway. RX counts everyone walking in, while TX counts everyone walking out. Simple enough, right? The confusion often comes when trying to understand what these patterns mean for your specific setup.

Real Network Example: Mellanox ONYX Switch Analysis

Let's look at actual output from a production Mellanox ONYX switch to see how this works in practice. First, let's check which ports are actually up:

curl -k -b cookie.txt -X POST https://192.168.3.5/admin/launch?script=json&template=json-request&action=json-login \
-H "Content-Type: application/json" \
-d '{
"cmd": "show interfaces ethernet description",
"execution_type": "sync"
}'

Output shows our active ports:

{
"Eth1/1": {
"Operational state": "Up",
"Speed": "25G"
},
"Eth1/7": {
"Operational state": "Up",
"Speed": "25G"
},
"Eth1/19": {
"Operational state": "Up",
"Speed": "100G"
},
"Eth1/21": {
"Operational state": "Up",
"Speed": "100G"
}
}

Reading Traffic Patterns from Real Data

Now let's examine the actual traffic statistics:

"Eth1/7": {
"Rx": {
"packets": "6078365878",
"bytes": "46755253366062",
"packets Jumbo": "6047872646"
},
"Tx": {
"packets": "21008230",
"bytes": "3877663298"
}
}

This port receives 289× more data than it sends - classic collector pattern!

"Eth1/19": {
"Rx": {
"packets": "5619576",
"bytes": "370489758"
},
"Tx": {
"packets": "41154237203",
"bytes": "316279712716706"
}
}

This port transmits 7,324× more packets than it receives - a massive distribution hub!

Why Direction Matters: Full-Duplex Explained

Every network port has two independent data paths. When we see "100G" port speed, that means:

  • 100 Gbps receiving capacity AND
  • 100 Gbps transmitting capacity
  • Total theoretical throughput: 200 Gbps bidirectional

Common Misconceptions Revealed

Initially, you might assume camera-to-server traffic would look like:

  • Camera ports: High TX (sending video)
  • Server ports: High RX (receiving video)

But our real data shows the opposite! Here's the actual traffic flow:

Port 7 (25G) ──RX(6B packets)──> Switch ──TX(41B packets)──> Port 19 (100G)
└──TX(2.6B packets)──> Port 21 (100G)

This suggests Port 7 is receiving from an aggregation point, and Ports 19/21 are distributing to multiple endpoints. The same "don't just look at the packets, look at the fabric" mindset applies when you're tracing AllReduce traffic — see How to calculate if your network is bottlenecking distributed training for the GPU-training equivalent.

Monitoring in Action: Getting Real-Time Metrics

Here's how to monitor these metrics on your switch:

# Get current counters for all interfaces
curl -k -b cookie.txt -X POST https://192.168.3.5/admin/launch?script=json&template=json-request&action=json-login \
-H "Content-Type: application/json" \
-d '{
"cmd": "show interfaces ethernet counters",
"execution_type": "sync"
}'

Calculating Actual Utilization

With the data we collected, let's calculate real bandwidth usage:

Port 19 (100G capacity):

  • TX: 316,279,712,716,706 bytes total
  • If accumulated over 30 days: ~97 Gbps average
  • Near maximum capacity!

Port 7 (25G capacity):

  • RX: 46,755,253,366,062 bytes total
  • If accumulated over 30 days: ~14.4 Gbps average
  • Using 58% of link capacity

Key Metrics to Monitor

From our actual switch output, focus on these fields:

{
"Primary Metrics": {
"packets": "Total packet count",
"bytes": "Total byte count"
},
"Health Indicators": {
"error packets": "0",
"discard packets": "0",
"fcs errors": "0"
},
"Traffic Types": {
"unicast packets": "41123912532",
"multicast packets": "30090769",
"broadcast packets": "233902"
}
}

Practical Monitoring Script

To continuously monitor RX/TX rates:

# Poll every 30 seconds and calculate rate
PREV_RX=0
PREV_TX=0

while true; do
# Get current counters (example for Port 19)
CURR_RX=$(curl -sk ... | jq '.["Eth1/19"][0]["Rx"][0]["bytes"]')
CURR_TX=$(curl -sk ... | jq '.["Eth1/19"][1]["Tx"][0]["bytes"]')

# Calculate rate in Mbps
RX_RATE=$(( ($CURR_RX - $PREV_RX) * 8 / 30 / 1000000 ))
TX_RATE=$(( ($CURR_TX - $PREV_TX) * 8 / 30 / 1000000 ))

echo "Port 19: RX=${RX_RATE} Mbps, TX=${TX_RATE} Mbps"

PREV_RX=$CURR_RX
PREV_TX=$CURR_TX
sleep 30
done

The PTP Clue: Understanding the Context

Our switch configuration reveals another important detail:

"show running-config" output:
##
## PTP protocol
##
protocol ptp
interface ethernet 1/1 ptp enable
interface ethernet 1/7 ptp enable
interface ethernet 1/19 ptp enable

PTP (Precision Time Protocol) on all ports indicates this is a video production environment where precise timing synchronization is critical for frame-accurate video capture.

Key Takeaways from Real Data

  1. Don't assume traffic direction - Our "video" ports showed opposite patterns than expected
  2. Jumbo frames indicate video - 6 billion jumbo packets on Port 7 suggest video traffic
  3. 100G ports as distributors - Both 100G ports primarily transmit, indicating fan-out architecture
  4. Monitor both directions - Full-duplex means both paths matter for capacity planning
  5. Context matters - PTP configuration revealed this was video production, explaining the traffic patterns

Remember: These RX/TX metrics are always from the port's perspective. When troubleshooting, physically trace cables or clear counters to see fresh traffic patterns rather than historical accumulation. If you're trying to correlate switch-side counters with distributed training throughput end-to-end, our distributed training optimization work routinely starts at exactly this layer.


Pro tip: If your switch API is slow (>1 second response time), use SNMP polling instead - it's 10-100x faster for retrieving interface counters!

{/ TODO: review FAQ answers before publishing — drafted from post content, refine for positioning. /}

Frequently Asked Questions

What do RX and TX mean on a network switch?

RX (receive) counts traffic coming into a switch port from whatever is cabled to it, and TX (transmit) counts traffic leaving the port. Both counters are from the switch port's point of view, so a server sending data to the switch shows as RX on the switch port and TX on the server's NIC. Every port is full-duplex, so both can run at line rate simultaneously.

Why do my switch ports show such asymmetric RX vs TX traffic?

Asymmetry is normal and tells you the role of the port in your topology. Ports aggregating data from many sources (collectors) show RX far larger than TX. Uplinks and fan-out ports that push data to multiple endpoints show TX far larger than RX. For GPU training clusters, well-behaved AllReduce traffic tends to be balanced — a persistent imbalance during training usually indicates a topology or NCCL ring issue.

How do I calculate actual bandwidth utilization from byte counters?

Sample the byte counter twice at a known interval, take the delta, convert to bits, and divide by the interval to get bps. A simple 30-second polling loop gives you a bandwidth-over-time view. Compare to the port's rated capacity (e.g. 100 Gbps) to get utilization. For short bursts, poll every second or use sFlow/IPFIX — accumulated counters hide burstiness.

Are RX/TX counters reliable for detecting network problems?

Byte and packet counters show throughput, not health. Pair them with error counters (CRC errors, discards, pause frames, ECN marks) and link-state events. A port at 80% utilization with zero errors is healthy; a port at 20% utilization with rising discards and pause frames is a much bigger problem. For RDMA/RoCE fabrics, PFC and ECN counters matter more than raw bandwidth.

Should I use REST/API or SNMP to poll switch counters?

For high-frequency polling use SNMP — it's typically 10-100x faster than vendor REST APIs and designed for exactly this workload. Use REST/API when you need structured data (e.g. per-queue counters) or to drive automation. Many monitoring stacks combine both: SNMP for continuous counter scraping, REST for configuration and richer topology metadata.


· 6 min read

When monitoring GPU infrastructure, you might occasionally notice something peculiar: your high-end GPUs connected to PCIe 5.0 slots are operating at PCIe Gen 1 speeds. Before raising any alarms, let's walk through a systematic debugging approach that reveals this is often expected behavior rather than a configuration issue.

{/ truncate /}

The Discovery

During routine infrastructure monitoring on our dual-GPU setup, we observed an interesting discrepancy. Running a simple nvidia-smi query revealed our GPUs were operating at PCIe Gen 1:

nvidia-smi --query-gpu=index,name,pcie.link.gen.current,pcie.link.gen.max --format=csv

The output fields tell an important story:

  • index: GPU identifier
  • name: GPU model designation
  • pcie.link.gen.current: Active PCIe generation
  • pcie.link.gen.max: Maximum supported PCIe generation

The performance implications seemed significant at first glance. PCIe Gen 1 provides approximately 250 MB/s per lane, while PCIe Gen 4 delivers around 2 GB/s per lane. For a standard x16 configuration, this translates to:

  • Current state (Gen 1): ~4 GB/s total bandwidth
  • Potential capacity (Gen 4): ~32 GB/s total bandwidth

Verifying the Configuration

To ensure this wasn't a hardware misconfiguration, we proceeded with a thorough investigation using standard Linux diagnostic tools. This kind of low-level verification is one of the signals we rely on when building a GPU observability and reliability stack — a single downgraded link can otherwise hide behind green dashboards for weeks.

First, we identified the NVIDIA devices and examined their link capabilities:

sudo lspci | grep NVIDIA
sudo lspci -vv -s 05:00.0 | grep -E "LnkCap:|LnkSta:"
sudo lspci -vv -s e1:00.0 | grep -E "LnkCap:|LnkSta:"

The output confirmed both GPUs were indeed operating at 2.5GT/s (Gen 1 speed), with the LinkStatus showing a downgraded state from their maximum capability.

Motherboard Slot Analysis

To rule out physical slot limitations, we examined the system's slot configuration:

sudo dmidecode -t slot | grep -A5 -B5 "PCI"

The results were illuminating:

  • Slot PCIEx16(G5)_1 at Bus 0000:e1:00.0 hosting GPU 1
  • Slot PCIEx16(G5)_7 at Bus 0000:05:00.0 hosting GPU 0

Both GPUs were properly seated in PCIe 5.0 x16 slots, confirming the hardware configuration was correct.

Understanding the Behavior

After confirming the hardware configuration was sound, the explanation became clear: this is intelligent power management at work. Modern GPUs implement sophisticated power-saving mechanisms that dynamically adjust PCIe link speeds based on workload demands.

When GPUs are idle or under minimal load, they automatically scale down to PCIe Gen 1 to reduce power consumption. This behavior is particularly important in datacenter environments where energy efficiency directly impacts operational costs. The GPUs will automatically negotiate higher PCIe generations when computational demands require increased bandwidth.

The same "is it actually a problem?" question comes up for network counters too — see Understanding RX vs TX traffic direction for an example where the raw numbers look alarming but context reveals normal behavior.

Key Takeaways

This investigation highlights several important considerations for infrastructure monitoring:

  1. Dynamic scaling is a feature, not a bug: PCIe generation scaling is an expected power management behavior in modern GPU systems.

  2. Context matters in monitoring: Performance metrics should be evaluated in the context of current workload demands rather than absolute capabilities.

  3. Systematic debugging pays dividends: Using multiple diagnostic tools (nvidia-smi, lspci, dmidecode) provides a complete picture of system state.

  4. Power efficiency at scale: In production environments with hundreds or thousands of GPUs, these power optimizations can result in substantial energy savings during idle periods.

For teams managing GPU infrastructure, understanding these power management behaviors helps distinguish between actual configuration issues and normal operating states. When troubleshooting performance concerns, always verify PCIe generation under load conditions rather than idle states to get an accurate assessment of your system's operational characteristics.

{/ TODO: review FAQ answers before publishing — drafted from post content, refine for positioning and concision. /}

Frequently Asked Questions

Why does nvidia-smi show my GPU at PCIe Gen 1 when the slot is Gen 5?

Modern NVIDIA GPUs implement Active State Power Management and dynamically downshift the PCIe link to Gen 1 when the device is idle to save power. As soon as a workload hits the GPU, the link renegotiates up to its trained maximum generation. Seeing Gen 1 at idle on a Gen 5 slot is expected behavior, not a fault.

How do I confirm the GPU is running at full PCIe speed under load?

Run a workload that actually drives PCIe traffic (e.g. `nvidia-smi dmon`, an NCCL bandwidth test, or a training job) and poll `nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv` during the run. The link generation and width should rise to match the values reported by `pcie.link.gen.max` and `pcie.link.width.max`.

Which tools should I use to debug PCIe link state?

Use three complementary tools: `nvidia-smi` for the GPU's view of the link, `lspci -vv` (look at `LnkCap` vs `LnkSta`) for the PCIe controller's view, and `dmidecode -t slot` to confirm the physical slot capability. If all three agree on Gen 5 max and only the current state is Gen 1 at idle, this is power management, not a defect.

When is a downgraded PCIe link actually a problem?

It is a real problem when the link stays at a reduced generation or width under sustained GPU load. Persistent Gen 3 on a Gen 4/5-capable platform, an unexpected x8 width on an x16 slot, or frequent link retraining events in `dmesg` all indicate physical issues — riser cables, seating, firmware mismatches, or a failing slot — and deserve investigation.

Does PCIe power management affect distributed training performance?

Not meaningfully for well-tuned training jobs. NCCL AllReduce and gradient transfers generate enough sustained PCIe traffic to keep the link at its maximum generation. The exception is very bursty, small-message workloads where the link may transition in and out of low-power states — in that case you can pin the link via `nvidia-smi -pm` and BIOS/ACPI tuning.