Skip to main content
Service

GPU Monitoring & Observability

A 64-GPU training job runs 18 hours, then crashes on an XID error from one bad GPU. Without monitoring, your team restarts on the same node and loses another day. With proper observability, the failing GPU is flagged and drained in under 60 seconds.

100+GPU Metrics Tracked
<60sFault Detection
99.9%Uptime Target

What We Do

  • DCGM metrics stack — DCGM Exporter deployment, custom field groups per workload type, collection intervals tuned for training vs inference
  • Prometheus integration — ServiceMonitor/PodMonitor setup, recording rules for cluster aggregations, remote write to Thanos/Cortex for large clusters
  • Grafana dashboards — Cluster overview, per-node GPU detail, job performance correlation, hardware health trends, capacity planning
  • XID error detection — Real-time XID monitoring from kernel logs and DCGM, severity classification, automated node drain for critical errors (31, 43, 45, 48, 64, 69, 74, 79, 92, 119)
  • GPU health monitoring — ECC error trend tracking (predicts failures days in advance), thermal throttling detection, PCIe link speed degradation alerts, NVLink error counters
  • Automated recovery — Detect fault → cordon node → attempt GPU reset → run DCGM diagnostics → uncordon or escalate. Node Problem Detector integration
  • Alerting rules — Critical XID errors, ECC double-bit errors, thermal breaches, stuck/idle workloads, NVLink errors, PCIe bandwidth degradation, OOM prevention

Proof

We've deployed GPU monitoring stacks across bare-metal and cloud Kubernetes clusters. Our monitoring setup caught a degrading GPU (rising single-bit ECC errors) 72 hours before it would have caused a job failure — the GPU was drained and replaced during a maintenance window with zero training disruption.

How We Work

1

Assess

Audit current monitoring gaps. Most clusters have zero GPU observability.

2

Deploy

DCGM Exporter, Prometheus, Grafana, alerting, recovery automation.

3

Tune

Adjust thresholds and collection intervals for your SLOs.

4

Transfer

Dashboards, runbooks, alert playbooks, on-call procedures.

Technologies

DCGMDCGM ExporterPrometheusGrafanaAlertmanagerThanosNode Problem DetectorGPU Operatornvidia-smiKubernetesSlurm

Related

Frequently Asked Questions

What is DCGM?

NVIDIA DCGM (Data Center GPU Manager) is the official toolkit for GPU telemetry, diagnostics, and policy management. It exposes per-GPU utilization, memory, temperature, power, ECC errors, XID events, and PCIe metrics through a Prometheus exporter. It's the reliable source of truth for GPU health.

Which GPU metrics matter for reliability?

SM utilization, memory bandwidth utilization, XID errors, ECC DBE/SBE counts, power draw, thermal throttling events, PCIe replay counts, and NVLink error counters. For training, add NCCL timeouts and AllReduce duration. These catch the majority of hardware and driver issues before jobs crash.

What is an XID error?

XID errors are NVIDIA driver events reported via the kernel log when something goes wrong — ECC failures, GPU falls off the bus, hardware errors, timeouts. Some are transient, others (like XID 79) are fatal and require node replacement. Mature monitoring alerts on these and automates node draining for critical codes.

How fast can GPU observability detect failures?

With DCGM scraping at 5-15 second intervals and proper alerts, most failures — thermal throttling, ECC storms, XID events, PCIe link downgrade — are detected within a minute. Fail-fast controllers can drain the affected pod automatically, so jobs restart on healthy hardware.

Can you integrate with our existing monitoring stack?

Yes. DCGM exports Prometheus metrics, so it drops into any stack built on Prometheus, Grafana, VictoriaMetrics, Mimir, Datadog, or Grafana Cloud. We also integrate with PagerDuty, Opsgenie, Loki, and Elastic, and build GPU-specific Grafana dashboards on top of your existing setup.

Tired of GPU Failures Going Undetected?

We build monitoring that catches GPU issues before they crash your training jobs.

Schedule a Call