dcgm – Sharpey Speaks

In our previous post, we explored NVIDIA’s DCGM toolkit, explored how it enables real-time GPU health monitoring, diagnostics, and alerts for data-center environments.
In this post, we’ll look at the AMD ecosystem — specifically, how to manage and monitor AMD GPUs using the AMD System Management Interface (SMI).

What Is AMD SMI

AMD SMI (System Management Interface) is the command-line and library toolset that gives you visibility into the operational health of AMD GPUs.
It’s part of the ROCm (Radeon Open Compute) platform and provides low-level access to telemetry, configuration, and diagnostic data.

Think of it as AMD’s closest equivalent to DCGM and it offers:

Monitor GPU health and utilization
Track power, temperature, and memory stats
Perform diagnostics and resets
Manage performance states (clocks, power caps)
Collect hardware telemetry for observability systems

What AMD SMI Offers:

Basic GPU Information

The first step in understanding any GPU node is gathering hardware identity and topology details.
AMD SMI enables you to query model names, firmware versions, and interconnect topology, allowing you to know exactly what’s running in your system.

They provide data such as:

Product name and SKU: confirms whether the node runs MI300X, MI250, or older Instinct series.
Firmware and VBIOS versions: critical for validating driver compatibility across mixed clusters.
PCIe configuration and NUMA locality: shows how each GPU is wired to the CPU or switch fabric, helping diagnose I/O bottlenecks and cross-NUMA latency.
Topology information: visualize multi-GPU interconnects (xGMI or Infinity Fabric links) for debugging peer-to-peer bandwidth issues.

Having this baseline inventory makes it easier to detect firmware drift, inconsistent driver stacks, or nodes provisioned with mismatched GPUs.

Health & Telemetry

Health telemetry is at the core of day-to-day GPU monitoring.
AMD SMI exposes real-time operational metrics that reflect the GPU’s thermal, electrical, and workload conditions.

It exposes data such as:

Temperature sensors: reports edge, junction, and memory temperatures. Rising averages often signal airflow or paste degradation.
Power draw and voltage rails: track instantaneous and average consumption versus TDP limits; useful for identifying throttling or PSU saturation.
Fan speed and control mode: confirm thermal regulation behavior; fans stuck at fixed RPM indicate firmware or sensor faults.
Performance Levels: shows current clock domains and frequency scaling behavior; helpful for detecting down-clocking under load.
HBM/VRAM utilisation: monitors buffer pressure and helps correlate performance dips with paging or over-subscription.

Collecting these metrics periodically builds the baseline for drift detection, capacity planning, and thermal optimization across large GPU clusters.

Diagnostics & Events

AMD GPUs incorporate RAS (Reliability, Availability, and Serviceability) features to detect and report hardware errors before they impact workloads.
AMD SMI provides direct access to this diagnostic layer.

These outputs include:

ECC error counts: both correctable and uncorrectable events for VRAM or cache. Spikes often indicate memory degradation or cooling issues.
RAS feature status: confirms whether ECC, page retirement, and poison handling are active.
Event logs: record GPU resets, hangs, and driver-level recoveries with timestamps.
Error categories: classify faults (memory, PCIe, fabric, thermal) to help automate root-cause analysis.

Regularly parsing this data helps you detect failing GPUs early, isolate unstable hosts, and correlate hardware faults with workload or environmental changes.
It’s the diagnostic backbone for proactive maintenance and warranty tracking.

Control & Management

Beyond observation, AMD SMI allows operators to control power, clock, and performance policies directly from the CLI. This is especially valuable for consistent benchmarking, workload tuning, or enforcing cluster-wide limits.

Capabilities include:

Power cap management: define GPU-specific wattage ceilings to balance performance and power budgets across dense racks.
Clock domain tuning: lock core and memory frequencies for reproducible benchmarks or stress tests.
Fan and thermal control: manually override cooling profiles in lab or diagnostic environments.
State resets: revert clocks, power limits, and thermal profiles back to defaults after tests.

In production, these controls are typically automated through orchestration agents or monitoring frameworks (like Asama Compass) that apply safe limits, ensure uniform settings, and trigger remediation when thresholds are exceeded.

Integrating with Monitoring & Automation

The AMD SMI library (libamdsmi.so) allows programmatic access to the same telemetry data, enabling:

Exporters for Prometheus or Asama Compass agents
Custom health checks (temperature drift, RAS error thresholds)
Periodic baseline verification across GPU nodes

For example, a monitoring agent can periodically call the SMI API to record temperature and ECC counts, raising alerts when thresholds deviate from baseline.

Conclusion

AMD System Management Interface (SMI) helps with continuous health monitoring and diagnostics of AMD GPUs. We at Asama.ai are deeply integrated with NVIDIA enabling finding issues and remediating them.

And yes, we’re always looking for developers who care about infrastructure, observability, and open source.

When operating GPU infrastructure—whether for AI, ML, HPC, rendering, or simulations—your uptime and performance depend on knowing whether your GPUs are healthy before they fail catastrophically. Faulty memory lanes, ECC errors, power instability, or thermal issues can degrade performance or cause silent errors.

In this post and subsequent posts, we explore what NVIDIA and AMD offer and how they can be used in your environment.

NVIDIA’s DCGM: The toolkit that to provide real-time health checks, diagnostics, and alerts for NVIDIA GPU fleets. The health and diagnostic features help you:

Detect latent hardware or configuration issues early
Automate routine validation and alerting
Correlate hardware-level failures with workload anomalies

In this post, I’ll walk through how DCGM enables diagnostics and health monitoring—what it offers, how it works, and what to watch out for.

What DCGM Offers:

Continuous Health Monitoring

For infrastructure engineers, health monitoring is baseline hygiene.
DCGM tracks everything that matters at the silicon level:

Memory bandwidth & ECC checks — catch degradation early
Thermal drift — detect cooling failures and hotspots
NVLink integrity — ensure interconnect reliability
Power stability — monitor rails, transients, and throttling

These continuous checks are non-invasive, low-overhead, and essential for keeping a GPU cluster in steady state.

Diagnostics

Monitoring tells you something’s wrong.
Diagnostics tell you what and why.

DCGM diagnostics are invasive — they stress and validate every GPU subsystem.
Ideal for:

Maintenance windows
Burn-in testing
Root-cause analysis

They uncover:

Deployment and driver issues
Integration or container runtime conflicts
Stress-induced thermal/power anomalies
Hardware-level faults (PCIe, VRAM, regulators)

How Diagnostics Work: Levels & Workflows

Diagnostic Levels

DCGM supports multiple diagnostic “levels” (e.g. Level 1 through Level 4). The idea is:

Level 1 / 2: lightweight, fast sanity checks (good for frequent run)
Level 3 / 4: deeper stress / memory / link tests (for maintenance windows or postmortem)

You choose a level depending on how deep you want to go and how long you can afford the test to run.

Running Diagnostics via DCGMI

DCGMI is the CLI front end for DCGM. Example commands:

dcgmi diag -r 1         # run level 1 diagnostic
dcgmi diag -r 4         # run deepest diagnostic (if supported)
dcgmi health -s a       # start health monitoring on all GPUs
dcgmi health -c         # query current health status

You can also tailor diagnostics by adjusting parameters (e.g., memory thresholds, enabling or disabling specific tests).

Conclusion

NVIDIA’s DCGM is a toolkit that helps with continuous health monitoring and Diagnostics of NVIDIA’s GPUs. We at Asama.ai are deeply integrated with NVIDIA enabling finding issues and remediating them.

And yes, we’re always looking for developers who care about infrastructure, observability, and open source.

Tag: dcgm

GPU Health & Diagnostics (Part 2): AMD GPUs with AMD SMI

What Is AMD SMI

What AMD SMI Offers:

Basic GPU Information

Health & Telemetry

Diagnostics & Events

Control & Management

Integrating with Monitoring & Automation

Conclusion

GPU Health & Diagnostics: Why They Matter

What DCGM Offers:

Continuous Health Monitoring

Diagnostics

How Diagnostics Work: Levels & Workflows

Diagnostic Levels

Running Diagnostics via DCGMI

Conclusion