GPU Health & Diagnostics (Part 2): AMD GPUs with AMD SMI

In our previous post, we explored NVIDIA’s DCGM toolkit, explored how it enables real-time GPU health monitoring, diagnostics, and alerts for data-center environments.
In this post, we’ll look at the AMD ecosystem — specifically, how to manage and monitor AMD GPUs using the AMD System Management Interface (SMI).

What Is AMD SMI

AMD SMI (System Management Interface) is the command-line and library toolset that gives you visibility into the operational health of AMD GPUs.
It’s part of the ROCm (Radeon Open Compute) platform and provides low-level access to telemetry, configuration, and diagnostic data.

Think of it as AMD’s closest equivalent to DCGM and it offers:

Monitor GPU health and utilization
Track power, temperature, and memory stats
Perform diagnostics and resets
Manage performance states (clocks, power caps)
Collect hardware telemetry for observability systems

What AMD SMI Offers:

Basic GPU Information

The first step in understanding any GPU node is gathering hardware identity and topology details.
AMD SMI enables you to query model names, firmware versions, and interconnect topology, allowing you to know exactly what’s running in your system.

They provide data such as:

Product name and SKU: confirms whether the node runs MI300X, MI250, or older Instinct series.
Firmware and VBIOS versions: critical for validating driver compatibility across mixed clusters.
PCIe configuration and NUMA locality: shows how each GPU is wired to the CPU or switch fabric, helping diagnose I/O bottlenecks and cross-NUMA latency.
Topology information: visualize multi-GPU interconnects (xGMI or Infinity Fabric links) for debugging peer-to-peer bandwidth issues.

Having this baseline inventory makes it easier to detect firmware drift, inconsistent driver stacks, or nodes provisioned with mismatched GPUs.

Health & Telemetry

Health telemetry is at the core of day-to-day GPU monitoring.
AMD SMI exposes real-time operational metrics that reflect the GPU’s thermal, electrical, and workload conditions.

It exposes data such as:

Temperature sensors: reports edge, junction, and memory temperatures. Rising averages often signal airflow or paste degradation.
Power draw and voltage rails: track instantaneous and average consumption versus TDP limits; useful for identifying throttling or PSU saturation.
Fan speed and control mode: confirm thermal regulation behavior; fans stuck at fixed RPM indicate firmware or sensor faults.
Performance Levels: shows current clock domains and frequency scaling behavior; helpful for detecting down-clocking under load.
HBM/VRAM utilisation: monitors buffer pressure and helps correlate performance dips with paging or over-subscription.

Collecting these metrics periodically builds the baseline for drift detection, capacity planning, and thermal optimization across large GPU clusters.

Diagnostics & Events

AMD GPUs incorporate RAS (Reliability, Availability, and Serviceability) features to detect and report hardware errors before they impact workloads.
AMD SMI provides direct access to this diagnostic layer.

These outputs include:

ECC error counts: both correctable and uncorrectable events for VRAM or cache. Spikes often indicate memory degradation or cooling issues.
RAS feature status: confirms whether ECC, page retirement, and poison handling are active.
Event logs: record GPU resets, hangs, and driver-level recoveries with timestamps.
Error categories: classify faults (memory, PCIe, fabric, thermal) to help automate root-cause analysis.

Regularly parsing this data helps you detect failing GPUs early, isolate unstable hosts, and correlate hardware faults with workload or environmental changes.
It’s the diagnostic backbone for proactive maintenance and warranty tracking.

Control & Management

Beyond observation, AMD SMI allows operators to control power, clock, and performance policies directly from the CLI. This is especially valuable for consistent benchmarking, workload tuning, or enforcing cluster-wide limits.

Capabilities include:

Power cap management: define GPU-specific wattage ceilings to balance performance and power budgets across dense racks.
Clock domain tuning: lock core and memory frequencies for reproducible benchmarks or stress tests.
Fan and thermal control: manually override cooling profiles in lab or diagnostic environments.
State resets: revert clocks, power limits, and thermal profiles back to defaults after tests.

In production, these controls are typically automated through orchestration agents or monitoring frameworks (like Asama Compass) that apply safe limits, ensure uniform settings, and trigger remediation when thresholds are exceeded.

Integrating with Monitoring & Automation

The AMD SMI library (libamdsmi.so) allows programmatic access to the same telemetry data, enabling:

Exporters for Prometheus or Asama Compass agents
Custom health checks (temperature drift, RAS error thresholds)
Periodic baseline verification across GPU nodes

For example, a monitoring agent can periodically call the SMI API to record temperature and ECC counts, raising alerts when thresholds deviate from baseline.

Conclusion

AMD System Management Interface (SMI) helps with continuous health monitoring and diagnostics of AMD GPUs. We at Asama.ai are deeply integrated with NVIDIA enabling finding issues and remediating them.

And yes, we’re always looking for developers who care about infrastructure, observability, and open source.