Question 1

What is DCGM?

Accepted Answer

NVIDIA DCGM (Data Center GPU Manager) is the official toolkit for GPU telemetry, diagnostics, and policy management. It exposes per-GPU utilization, memory, temperature, power, ECC errors, XID events, and PCIe metrics through a Prometheus exporter. It's the reliable source of truth for GPU health.

Question 2

Which GPU metrics matter for reliability?

Accepted Answer

SM utilization, memory bandwidth utilization, XID errors, ECC DBE/SBE counts, power draw, thermal throttling events, PCIe replay counts, and NVLink error counters. For training, add NCCL timeouts and AllReduce duration. These catch the majority of hardware and driver issues before jobs crash.

Question 3

What is an XID error?

Accepted Answer

XID errors are NVIDIA driver events reported via the kernel log when something goes wrong — ECC failures, GPU falls off the bus, hardware errors, timeouts. Some are transient, others (like XID 79) are fatal and require node replacement. Mature monitoring alerts on these and automates node draining for critical codes.

Question 4

How fast can GPU observability detect failures?

Accepted Answer

With DCGM scraping at 5-15 second intervals and proper alerts, most failures — thermal throttling, ECC storms, XID events, PCIe link downgrade — are detected within a minute. Fail-fast controllers can drain the affected pod automatically, so jobs restart on healthy hardware.

Question 5

Can you integrate with our existing monitoring stack?

Accepted Answer

Yes. DCGM exports Prometheus metrics, so it drops into any stack built on Prometheus, Grafana, VictoriaMetrics, Mimir, Datadog, or Grafana Cloud. We also integrate with PagerDuty, Opsgenie, Loki, and Elastic, and build GPU-specific Grafana dashboards on top of your existing setup.

GPU Monitoring & Observability

What We Do

Proof

How We Work

Technologies

Related

Frequently Asked Questions

What is DCGM?

Which GPU metrics matter for reliability?

What is an XID error?

How fast can GPU observability detect failures?

Can you integrate with our existing monitoring stack?

Tired of GPU Failures Going Undetected?