Skip to main content

One post tagged with "troubleshooting"

View All Tags

· 8 min read

I was lucky to work on a computer vision setup which involved NVIDIA RTX 6000 GPUs, Mellanox ONYX switches, and high-resolution cameras. The system was designed for real-time video capture and processing, pushing massive amounts of data through our network infrastructure. Here's a gist of debugging Mellanox switch metrics that led to some surprising discoveries about network traffic flow.

When monitoring network equipment like switches, routers, or network cards, you'll constantly encounter two metrics: RX and TX. These simple abbreviations are fundamental to understanding how data flows through your network, yet they often cause confusion. Let's demystify them with real-world examples from our production Mellanox ONYX switch. If you're operating GPU infrastructure, this is exactly the kind of low-level visibility you need — it's a recurring theme in our GPU monitoring and observability engagements.

{/ truncate /}

The Basics: What Do RX and TX Mean?

RX = Receive - Data coming INTO a network port TX = Transmit - Data going OUT of a network port

Think of each network port like a doorway. RX counts everyone walking in, while TX counts everyone walking out. Simple enough, right? The confusion often comes when trying to understand what these patterns mean for your specific setup.

Real Network Example: Mellanox ONYX Switch Analysis

Let's look at actual output from a production Mellanox ONYX switch to see how this works in practice. First, let's check which ports are actually up:

curl -k -b cookie.txt -X POST https://192.168.3.5/admin/launch?script=json&template=json-request&action=json-login \
-H "Content-Type: application/json" \
-d '{
"cmd": "show interfaces ethernet description",
"execution_type": "sync"
}'

Output shows our active ports:

{
"Eth1/1": {
"Operational state": "Up",
"Speed": "25G"
},
"Eth1/7": {
"Operational state": "Up",
"Speed": "25G"
},
"Eth1/19": {
"Operational state": "Up",
"Speed": "100G"
},
"Eth1/21": {
"Operational state": "Up",
"Speed": "100G"
}
}

Reading Traffic Patterns from Real Data

Now let's examine the actual traffic statistics:

"Eth1/7": {
"Rx": {
"packets": "6078365878",
"bytes": "46755253366062",
"packets Jumbo": "6047872646"
},
"Tx": {
"packets": "21008230",
"bytes": "3877663298"
}
}

This port receives 289× more data than it sends - classic collector pattern!

"Eth1/19": {
"Rx": {
"packets": "5619576",
"bytes": "370489758"
},
"Tx": {
"packets": "41154237203",
"bytes": "316279712716706"
}
}

This port transmits 7,324× more packets than it receives - a massive distribution hub!

Why Direction Matters: Full-Duplex Explained

Every network port has two independent data paths. When we see "100G" port speed, that means:

  • 100 Gbps receiving capacity AND
  • 100 Gbps transmitting capacity
  • Total theoretical throughput: 200 Gbps bidirectional

Common Misconceptions Revealed

Initially, you might assume camera-to-server traffic would look like:

  • Camera ports: High TX (sending video)
  • Server ports: High RX (receiving video)

But our real data shows the opposite! Here's the actual traffic flow:

Port 7 (25G) ──RX(6B packets)──> Switch ──TX(41B packets)──> Port 19 (100G)
└──TX(2.6B packets)──> Port 21 (100G)

This suggests Port 7 is receiving from an aggregation point, and Ports 19/21 are distributing to multiple endpoints. The same "don't just look at the packets, look at the fabric" mindset applies when you're tracing AllReduce traffic — see How to calculate if your network is bottlenecking distributed training for the GPU-training equivalent.

Monitoring in Action: Getting Real-Time Metrics

Here's how to monitor these metrics on your switch:

# Get current counters for all interfaces
curl -k -b cookie.txt -X POST https://192.168.3.5/admin/launch?script=json&template=json-request&action=json-login \
-H "Content-Type: application/json" \
-d '{
"cmd": "show interfaces ethernet counters",
"execution_type": "sync"
}'

Calculating Actual Utilization

With the data we collected, let's calculate real bandwidth usage:

Port 19 (100G capacity):

  • TX: 316,279,712,716,706 bytes total
  • If accumulated over 30 days: ~97 Gbps average
  • Near maximum capacity!

Port 7 (25G capacity):

  • RX: 46,755,253,366,062 bytes total
  • If accumulated over 30 days: ~14.4 Gbps average
  • Using 58% of link capacity

Key Metrics to Monitor

From our actual switch output, focus on these fields:

{
"Primary Metrics": {
"packets": "Total packet count",
"bytes": "Total byte count"
},
"Health Indicators": {
"error packets": "0",
"discard packets": "0",
"fcs errors": "0"
},
"Traffic Types": {
"unicast packets": "41123912532",
"multicast packets": "30090769",
"broadcast packets": "233902"
}
}

Practical Monitoring Script

To continuously monitor RX/TX rates:

# Poll every 30 seconds and calculate rate
PREV_RX=0
PREV_TX=0

while true; do
# Get current counters (example for Port 19)
CURR_RX=$(curl -sk ... | jq '.["Eth1/19"][0]["Rx"][0]["bytes"]')
CURR_TX=$(curl -sk ... | jq '.["Eth1/19"][1]["Tx"][0]["bytes"]')

# Calculate rate in Mbps
RX_RATE=$(( ($CURR_RX - $PREV_RX) * 8 / 30 / 1000000 ))
TX_RATE=$(( ($CURR_TX - $PREV_TX) * 8 / 30 / 1000000 ))

echo "Port 19: RX=${RX_RATE} Mbps, TX=${TX_RATE} Mbps"

PREV_RX=$CURR_RX
PREV_TX=$CURR_TX
sleep 30
done

The PTP Clue: Understanding the Context

Our switch configuration reveals another important detail:

"show running-config" output:
##
## PTP protocol
##
protocol ptp
interface ethernet 1/1 ptp enable
interface ethernet 1/7 ptp enable
interface ethernet 1/19 ptp enable

PTP (Precision Time Protocol) on all ports indicates this is a video production environment where precise timing synchronization is critical for frame-accurate video capture.

Key Takeaways from Real Data

  1. Don't assume traffic direction - Our "video" ports showed opposite patterns than expected
  2. Jumbo frames indicate video - 6 billion jumbo packets on Port 7 suggest video traffic
  3. 100G ports as distributors - Both 100G ports primarily transmit, indicating fan-out architecture
  4. Monitor both directions - Full-duplex means both paths matter for capacity planning
  5. Context matters - PTP configuration revealed this was video production, explaining the traffic patterns

Remember: These RX/TX metrics are always from the port's perspective. When troubleshooting, physically trace cables or clear counters to see fresh traffic patterns rather than historical accumulation. If you're trying to correlate switch-side counters with distributed training throughput end-to-end, our distributed training optimization work routinely starts at exactly this layer.


Pro tip: If your switch API is slow (>1 second response time), use SNMP polling instead - it's 10-100x faster for retrieving interface counters!

{/ TODO: review FAQ answers before publishing — drafted from post content, refine for positioning. /}

Frequently Asked Questions

What do RX and TX mean on a network switch?

RX (receive) counts traffic coming into a switch port from whatever is cabled to it, and TX (transmit) counts traffic leaving the port. Both counters are from the switch port's point of view, so a server sending data to the switch shows as RX on the switch port and TX on the server's NIC. Every port is full-duplex, so both can run at line rate simultaneously.

Why do my switch ports show such asymmetric RX vs TX traffic?

Asymmetry is normal and tells you the role of the port in your topology. Ports aggregating data from many sources (collectors) show RX far larger than TX. Uplinks and fan-out ports that push data to multiple endpoints show TX far larger than RX. For GPU training clusters, well-behaved AllReduce traffic tends to be balanced — a persistent imbalance during training usually indicates a topology or NCCL ring issue.

How do I calculate actual bandwidth utilization from byte counters?

Sample the byte counter twice at a known interval, take the delta, convert to bits, and divide by the interval to get bps. A simple 30-second polling loop gives you a bandwidth-over-time view. Compare to the port's rated capacity (e.g. 100 Gbps) to get utilization. For short bursts, poll every second or use sFlow/IPFIX — accumulated counters hide burstiness.

Are RX/TX counters reliable for detecting network problems?

Byte and packet counters show throughput, not health. Pair them with error counters (CRC errors, discards, pause frames, ECN marks) and link-state events. A port at 80% utilization with zero errors is healthy; a port at 20% utilization with rising discards and pause frames is a much bigger problem. For RDMA/RoCE fabrics, PFC and ECN counters matter more than raw bandwidth.

Should I use REST/API or SNMP to poll switch counters?

For high-frequency polling use SNMP — it's typically 10-100x faster than vendor REST APIs and designed for exactly this workload. Use REST/API when you need structured data (e.g. per-queue counters) or to drive automation. Many monitoring stacks combine both: SNMP for continuous counter scraping, REST for configuration and richer topology metadata.