Skip to main content
Service

GPU Monitoring & Observability

GPU infrastructure fails silently. Training jobs hang, GPUs degrade, and utilization drops -- often without anyone knowing until hours of compute time are wasted. We build monitoring systems that give you complete visibility into GPU health, performance, and utilization.

100+GPU Metrics Tracked
<60sFault Detection
99.9%Uptime Target

The Cost of Blind GPU Operations

GPU hardware is the most expensive component of any AI infrastructure. A single DGX H100 system costs over $300,000. Yet most organizations operate their GPU clusters with less visibility than they have into a $50/month web server. Standard Kubernetes monitoring tools like Prometheus node-exporter and cAdvisor provide CPU, memory, and disk metrics but know nothing about GPU-specific health indicators.

Without proper GPU monitoring, organizations face several costly problems. Training jobs fail after hours of computation because a GPU developed ECC errors that went undetected. Utilization across the cluster averages 30% because no one can see which GPUs are idle and why. Thermal throttling silently reduces performance by 20-40% without any alert. Hardware defects cause intermittent failures that are nearly impossible to debug without historical metrics. And when a job does fail, the lack of telemetry means engineers spend hours or days reproducing and diagnosing issues that proper monitoring would have caught immediately.

Real scenario: A 64-GPU training job runs for 18 hours before crashing due to an XID error on a single GPU. Without monitoring, the team restarts the job on the same node, hits the same error 12 hours later, and loses another day of compute. With DCGM monitoring and XID detection, the failing GPU would have been flagged and drained within 60 seconds of the first error.

Our Monitoring Stack

DCGM Metrics Collection

NVIDIA Data Center GPU Manager (DCGM) is the foundation of GPU observability. We deploy DCGM Exporter across your cluster to collect over 100 GPU metrics at configurable intervals. The key metrics we track include GPU and memory utilization (DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL) for understanding workload efficiency, GPU memory usage (DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_FREE) for capacity planning and OOM prevention, temperature and power (DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_POWER_USAGE) for thermal management and power budgeting, SM clock and memory clock frequencies for detecting thermal throttling, PCIe throughput (DCGM_FI_DEV_PCIE_TX_THROUGHPUT, DCGM_FI_DEV_PCIE_RX_THROUGHPUT) for identifying data transfer bottlenecks, NVLink bandwidth and error counters for inter-GPU communication health, and ECC error counts (single-bit and double-bit) for predicting hardware failures.

We configure DCGM with custom field groups tailored to your workload. Training clusters need different metrics than inference serving clusters. We also set appropriate collection intervals -- high-frequency (1-second) for debugging sessions and standard (10-15 second) for long-term monitoring to balance granularity with storage costs.

Prometheus Integration

We integrate DCGM Exporter with Prometheus for metrics storage, query, and alerting. Our Prometheus configuration includes ServiceMonitor or PodMonitor resources for automatic DCGM Exporter discovery, recording rules that pre-compute common aggregations like per-node GPU utilization averages and cluster-wide memory usage, alerting rules for critical GPU conditions (detailed in the next section), and proper retention and storage configuration sized for your cluster. For large clusters with hundreds or thousands of GPUs, we configure Prometheus with remote write to long-term storage solutions like Thanos or Cortex to handle the high cardinality of per-GPU metrics.

Grafana Dashboards

We build purpose-built Grafana dashboards for different personas and use cases. The Cluster Overview dashboard shows aggregate GPU utilization, memory usage, and health status across the entire cluster at a glance. The Node Detail dashboard drills into individual nodes showing per-GPU metrics, PCIe topology, and NVLink status. The Job Performance dashboard correlates GPU metrics with training job metadata to identify which jobs are underutilizing resources. The Hardware Health dashboard tracks ECC errors, temperature trends, clock frequencies, and power consumption to predict failures before they impact workloads. And the Capacity Planning dashboard shows utilization trends over time to inform procurement and scheduling decisions.

XID Error Detection and GPU Health Monitoring

XID errors are NVIDIA's error reporting mechanism for GPU hardware and driver issues. They range from benign (XID 13: graphics engine exception, often a user-code bug) to critical (XID 79: GPU fallen off the bus, requiring a node reboot). We implement comprehensive XID monitoring that detects XID errors in real time from kernel logs and DCGM, classifies them by severity to determine the appropriate response, alerts the operations team with actionable context including the affected GPU, node, and running workload, and triggers automated remediation for known-recoverable errors.

Beyond XID errors, we monitor the broader indicators of GPU health. Rising ECC error rates often precede GPU failure by days or weeks, giving you time to migrate workloads before a crash. Clock frequency drops indicate thermal throttling that may point to cooling system issues. And PCIe link speed degradation can indicate a hardware problem with the GPU, riser card, or motherboard.

Alerting Rules We Implement

  • Critical XID errors (31, 43, 45, 48, 64, 69, 74, 79, 92, 119) trigger immediate node drain and page the on-call engineer
  • ECC double-bit errors trigger GPU isolation and workload migration
  • GPU temperature exceeding thermal threshold triggers throttling alerts
  • GPU utilization below 10% for extended periods flags potential stuck or idle workloads
  • NVLink errors exceeding threshold trigger investigation alerts
  • PCIe bandwidth degradation alerts when link trains at lower speed than expected
  • Memory utilization approaching 100% triggers OOM prevention warnings

Automated Recovery

Monitoring is only half the equation. When a GPU fails, the system needs to respond automatically to minimize the blast radius and recover as quickly as possible. We implement automated recovery workflows that detect the fault through DCGM metrics or XID error monitoring, cordon and drain the affected node to prevent new workloads from being scheduled on it, attempt GPU reset if the error is recoverable, run DCGM diagnostics to validate GPU health after reset, uncordon the node if diagnostics pass or escalate to human operators if they fail, and notify the team with a complete incident summary including root cause, affected workloads, and recovery status.

On Kubernetes, we implement this using a combination of custom controllers, node problem detectors, and integration with the GPU Operator health check capabilities. For Slurm environments, we integrate with the Slurm health check framework and prolog/epilog scripts.

Technologies We Work With

DCGMDCGM ExporterPrometheusGrafanaAlertmanagerThanosNode Problem DetectorGPU Operatornvidia-smiKubernetesSlurm

Related Resources

Tired of GPU Failures Going Undetected?

We build monitoring systems that catch GPU issues before they crash your training jobs. Get complete visibility into your GPU infrastructure with production-grade observability.

Schedule a Call