When operating GPU infrastructure—whether for AI, ML, HPC, rendering, or simulations—your uptime and performance depend on knowing whether your GPUs are healthy before they fail catastrophically. Faulty memory lanes, ECC errors, power instability, or thermal issues can degrade performance or cause silent errors.
In this post and subsequent posts, we explore what NVIDIA and AMD offer and how they can be used in your environment.
NVIDIA’s DCGM: The toolkit that to provide real-time health checks, diagnostics, and alerts for NVIDIA GPU fleets. The health and diagnostic features help you:
- Detect latent hardware or configuration issues early
- Automate routine validation and alerting
- Correlate hardware-level failures with workload anomalies
In this post, I’ll walk through how DCGM enables diagnostics and health monitoring—what it offers, how it works, and what to watch out for.
What DCGM Offers:
Continuous Health Monitoring
For infrastructure engineers, health monitoring is baseline hygiene.
DCGM tracks everything that matters at the silicon level:
- Memory bandwidth & ECC checks — catch degradation early
- Thermal drift — detect cooling failures and hotspots
- NVLink integrity — ensure interconnect reliability
- Power stability — monitor rails, transients, and throttling
These continuous checks are non-invasive, low-overhead, and essential for keeping a GPU cluster in steady state.
Diagnostics
Monitoring tells you something’s wrong.
Diagnostics tell you what and why.
DCGM diagnostics are invasive — they stress and validate every GPU subsystem.
Ideal for:
- Maintenance windows
- Burn-in testing
- Root-cause analysis
They uncover:
- Deployment and driver issues
- Integration or container runtime conflicts
- Stress-induced thermal/power anomalies
- Hardware-level faults (PCIe, VRAM, regulators)
How Diagnostics Work: Levels & Workflows
Diagnostic Levels
DCGM supports multiple diagnostic “levels” (e.g. Level 1 through Level 4). The idea is:
- Level 1 / 2: lightweight, fast sanity checks (good for frequent run)
- Level 3 / 4: deeper stress / memory / link tests (for maintenance windows or postmortem)
You choose a level depending on how deep you want to go and how long you can afford the test to run.
Running Diagnostics via DCGMI
DCGMI is the CLI front end for DCGM. Example commands:
dcgmi diag -r 1 # run level 1 diagnostic
dcgmi diag -r 4 # run deepest diagnostic (if supported)
dcgmi health -s a # start health monitoring on all GPUs
dcgmi health -c # query current health status
You can also tailor diagnostics by adjusting parameters (e.g., memory thresholds, enabling or disabling specific tests).
Conclusion
NVIDIA’s DCGM is a toolkit that helps with continuous health monitoring and Diagnostics of NVIDIA’s GPUs. We at Asama.ai are deeply integrated with NVIDIA enabling finding issues and remediating them.
And yes, we’re always looking for developers who care about infrastructure, observability, and open source.

One thought on “GPU Health & Diagnostics: Why They Matter”