GPU Health & Diagnostics (Part 2): AMD GPUs with AMD SMI

In our previous post, we explored NVIDIA’s DCGM toolkit, explored how it enables real-time GPU health monitoring, diagnostics, and alerts for data-center environments.
In this post, we’ll look at the AMD ecosystem — specifically, how to manage and monitor AMD GPUs using the AMD System Management Interface (SMI).

What Is AMD SMI

AMD SMI (System Management Interface) is the command-line and library toolset that gives you visibility into the operational health of AMD GPUs.
It’s part of the ROCm (Radeon Open Compute) platform and provides low-level access to telemetry, configuration, and diagnostic data.

Think of it as AMD’s closest equivalent to DCGM and it offers:

  • Monitor GPU health and utilization
  • Track power, temperature, and memory stats
  • Perform diagnostics and resets
  • Manage performance states (clocks, power caps)
  • Collect hardware telemetry for observability systems

What AMD SMI Offers: 

Basic GPU Information

The first step in understanding any GPU node is gathering hardware identity and topology details.
AMD SMI enables you to query model names, firmware versions, and interconnect topology, allowing you to know exactly what’s running in your system.

They provide data such as:

  • Product name and SKU: confirms whether the node runs MI300X, MI250, or older Instinct series.
  • Firmware and VBIOS versions: critical for validating driver compatibility across mixed clusters.
  • PCIe configuration and NUMA locality: shows how each GPU is wired to the CPU or switch fabric, helping diagnose I/O bottlenecks and cross-NUMA latency.
  • Topology information: visualize multi-GPU interconnects (xGMI or Infinity Fabric links) for debugging peer-to-peer bandwidth issues.

Having this baseline inventory makes it easier to detect firmware drift, inconsistent driver stacks, or nodes provisioned with mismatched GPUs.

Health & Telemetry

Health telemetry is at the core of day-to-day GPU monitoring.
AMD SMI exposes real-time operational metrics that reflect the GPU’s thermal, electrical, and workload conditions.

It exposes data such as:

  • Temperature sensors: reports edge, junction, and memory temperatures. Rising averages often signal airflow or paste degradation.
  • Power draw and voltage rails: track instantaneous and average consumption versus TDP limits; useful for identifying throttling or PSU saturation.
  • Fan speed and control mode: confirm thermal regulation behavior; fans stuck at fixed RPM indicate firmware or sensor faults.
  • Performance Levels: shows current clock domains and frequency scaling behavior; helpful for detecting down-clocking under load.
  • HBM/VRAM utilisation: monitors buffer pressure and helps correlate performance dips with paging or over-subscription.

Collecting these metrics periodically builds the baseline for drift detectioncapacity planning, and thermal optimization across large GPU clusters.

Diagnostics & Events

AMD GPUs incorporate RAS (Reliability, Availability, and Serviceability) features to detect and report hardware errors before they impact workloads.
AMD SMI provides direct access to this diagnostic layer.

These outputs include:

  • ECC error counts: both correctable and uncorrectable events for VRAM or cache. Spikes often indicate memory degradation or cooling issues.
  • RAS feature status: confirms whether ECC, page retirement, and poison handling are active.
  • Event logs: record GPU resets, hangs, and driver-level recoveries with timestamps.
  • Error categories: classify faults (memory, PCIe, fabric, thermal) to help automate root-cause analysis.

Regularly parsing this data helps you detect failing GPUs early, isolate unstable hosts, and correlate hardware faults with workload or environmental changes.
It’s the diagnostic backbone for proactive maintenance and warranty tracking.

Control & Management

Beyond observation, AMD SMI allows operators to control power, clock, and performance policies directly from the CLI. This is especially valuable for consistent benchmarking, workload tuning, or enforcing cluster-wide limits.

Capabilities include:

  • Power cap management:  define GPU-specific wattage ceilings to balance performance and power budgets across dense racks.
  • Clock domain tuning: lock core and memory frequencies for reproducible benchmarks or stress tests.
  • Fan and thermal control: manually override cooling profiles in lab or diagnostic environments.
  • State resets: revert clocks, power limits, and thermal profiles back to defaults after tests.

In production, these controls are typically automated through orchestration agents or monitoring frameworks (like Asama Compass) that apply safe limits, ensure uniform settings, and trigger remediation when thresholds are exceeded.

Integrating with Monitoring & Automation

The AMD SMI library (libamdsmi.so) allows programmatic access to the same telemetry data, enabling:

  • Exporters for Prometheus or Asama Compass agents
  • Custom health checks (temperature drift, RAS error thresholds)
  • Periodic baseline verification across GPU nodes

For example, a monitoring agent can periodically call the SMI API to record temperature and ECC counts, raising alerts when thresholds deviate from baseline.

Conclusion

AMD System Management Interface (SMI) helps with continuous health monitoring and diagnostics of AMD GPUs. We at Asama.ai are deeply integrated with NVIDIA enabling finding issues and remediating them. 

And yes, we’re always looking for developers who care about infrastructure, observability, and open source.

Clustering for Infrastructure Observability

Imagine walking through a large botanical garden filled with thousands of plants.
Some grow close together in thick patches, others are scattered in smaller clusters, and a few stand alone in remote corners.

If you were asked to group them, you’d probably do it by how similar they look e.g., color, leaf shape, height, or the type of soil they prefer.
You’d quickly notice that some groups are dense and obvious, while others are sparse and loosely related.
A few plants wouldn’t fit anywhere at all.

That’s essentially what clustering does, it identifies natural groupings within a large, unlabeled space.
It doesn’t need prior knowledge or fixed categories. Instead, it observes how things naturally relate to each other and organises them accordingly.

Some groups are strong and well-defined, others are weaker or short-lived, and some points don’t belong anywhere — they’re outliers.
The goal is simple: find structure inside apparent randomness.

What Is Clustering

Clustering is the process of automatically grouping similar data points.

For engineers, here is what it means:

  • Anomaly detection: grouping normal vs abnormal signals
  • Workload segmentation: clustering GPUs, jobs, or nodes by behaviour
  • Log and metric deduplication: grouping repeating fault patterns
  • Embedding analysis: grouping semantically similar vectors

In short, clustering helps convert noisy telemetry into structured, actionable insight.

How Clustering Works

At its core, every clustering method follows  steps:

  1. Measure similarity: Define how close two data points are using a distance metric (e.g., Euclidean, cosine).
  2. Group related points: Combine points that are close together into clusters.
  3. Evaluate stability: Check how cohesive and distinct each cluster is, and merge or split as needed.It’s about the structure that helps you reason about your data.

Two Broad Types of Clustering

There are many algorithms, but most fall into two broad categories:

1. Flat Clustering

Flat algorithms create a single partition of the data — every point belongs to exactly one cluster.
They’re simple and efficient but can struggle when cluster densities vary or when data has complex shapes.
Examples: k-means, k-medoids.

2. Hierarchical Clustering

Hierarchical algorithms build a tree (dendrogram) of clusters, capturing how groups form, merge, or split at different similarity levels.
This approach helps reveal structure at multiple scales — from fine-grained subgroups to broader categories.
Examples: BIRCH, HDBSCAN.

Evaluating Clustering Quality

Because clustering is unsupervised, there’s no single “correct” answer.
Common ways to assess cluster quality include:

  • Silhouette score:  measures how well points fit within their cluster vs others.
  • Davies–Bouldin index: compares internal cohesion to external separation.
  • Stability checks:  ensure clusters persist across different samples.
  • Visualization: project high-dimensional data into 2D (UMAP or t-SNE) to inspect structure and overlap.

How Asama AI uses Clustering

At Asama, we constantly observe a diverse range of devices to ensure optimal performance and proactively identify potential issues. The sheer volume and complexity of our infrastructure necessitate sophisticated tools for anomaly detection. This is where  Clustering proves invaluable.

We use various techniques for clustering based on various operational metrics, specs, and various other parameters. By grouping similar devices and their activities, we can effectively identify outliers that deviate significantly from their respective clusters. These deviations often indicate anomalies that could signal performance degradation, security breaches, or other operational issues.

The primary goal of this clustering and anomaly-detection process is to enable logical, efficient remediation. Once an anomaly is detected, our teams can quickly pinpoint the affected devices, understand the nature of the deviation, and take targeted actions to resolve the issue before it escalates.

References

  1. https://www.youtube.com/watch?v=7xHsRkOdVwo
  2. https://www.youtube.com/watch?v=4AW_5nYQkuc
  3. https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html

Summary

Clustering is the foundation of unsupervised learning, enabling us to discover natural order within complexity.
It helps systems, researchers, and engineers uncover structure where none was labelled before.

GPU Health & Diagnostics: Why They Matter

When operating GPU infrastructure—whether for AI, ML, HPC, rendering, or simulations—your uptime and performance depend on knowing whether your GPUs are healthy before they fail catastrophically. Faulty memory lanes, ECC errors, power instability, or thermal issues can degrade performance or cause silent errors.

In this post and subsequent posts, we explore what NVIDIA and AMD offer and how they can be used in your environment. 

NVIDIA’s DCGM: The toolkit that to provide real-time health checks, diagnostics, and alerts for NVIDIA GPU fleets. The health and diagnostic features help you:

  • Detect latent hardware or configuration issues early
  • Automate routine validation and alerting
  • Correlate hardware-level failures with workload anomalies

In this post, I’ll walk through how DCGM enables diagnostics and health monitoring—what it offers, how it works, and what to watch out for.

What DCGM Offers: 

Continuous Health Monitoring

For infrastructure engineers, health monitoring is baseline hygiene.
DCGM tracks everything that matters at the silicon level:

  • Memory bandwidth & ECC checks — catch degradation early
  • Thermal drift — detect cooling failures and hotspots
  • NVLink integrity — ensure interconnect reliability
  • Power stability — monitor rails, transients, and throttling

These continuous checks are non-invasive, low-overhead, and essential for keeping a GPU cluster in steady state.

Diagnostics

Monitoring tells you something’s wrong.
Diagnostics tell you what and why.

DCGM diagnostics are invasive — they stress and validate every GPU subsystem.
Ideal for:

  • Maintenance windows
  • Burn-in testing
  • Root-cause analysis

They uncover:

  • Deployment and driver issues
  • Integration or container runtime conflicts
  • Stress-induced thermal/power anomalies
  • Hardware-level faults (PCIe, VRAM, regulators)

How Diagnostics Work: Levels & Workflows

Diagnostic Levels

DCGM supports multiple diagnostic “levels” (e.g. Level 1 through Level 4). The idea is:

  • Level 1 / 2: lightweight, fast sanity checks (good for frequent run)
  • Level 3 / 4: deeper stress / memory / link tests (for maintenance windows or postmortem) 

You choose a level depending on how deep you want to go and how long you can afford the test to run.

Running Diagnostics via DCGMI

DCGMI is the CLI front end for DCGM. Example commands:

dcgmi diag -r 1         # run level 1 diagnostic
dcgmi diag -r 4         # run deepest diagnostic (if supported)
dcgmi health -s a       # start health monitoring on all GPUs
dcgmi health -c         # query current health status

You can also tailor diagnostics by adjusting parameters (e.g., memory thresholds, enabling or disabling specific tests).

Conclusion

NVIDIA’s DCGM is a toolkit that helps with continuous health monitoring and Diagnostics of NVIDIA’s GPUs. We at Asama.ai are deeply integrated with NVIDIA enabling finding issues and remediating them. 

And yes, we’re always looking for developers who care about infrastructure, observability, and open source.

Additional PCI Device Metrics in Prometheus Node Exporter

In continuation of our previous post, we are bringing additional features to the pcidevice collector in node_exporter, through PR #3425.

This enhancement builds on three key themes:

  • Extended PCI Metrics (powered by PR #748)
  • Translating Numeric IDs into Human-Friendly Names
  • Improved Stability via Nil-Pointer Checks

What’s New

1. Extended PCI Metrics

The pcidevice collector is now enriched with several new fields:

  • NUMA node – identifies which NUMA node the device is attached to.
  • SR-IOV details – reports the number of virtual functions, total VFs, etc.
  • Driver autoprobe flag – tracks whether driver probing is enabled.
  • Power state & D3Cold – exposes device power state and low-power capability.

Previously, collecting this data required a custom textfile collector. With this update, these attributes become first-class citizens in Node Exporter.

2. ID → Name Conversion

A highly ergonomic addition: numeric PCI IDs (vendor, device, class) can now be optionally mapped to human-readable names.

Example:

  • Before → {vendor_id=0x8086}
  • After → {vendor_id=0x8086, vendor_name="Intel Corporation"}

This mapping relies on the system’s pci.ids file (or a user-specified alternative). It’s disabled by default to minimize overhead, but for dashboards, debug logs, and alerts, the readability boost is huge.

3. Nil-Pointer Checks

Alongside the new features, PR #3425 adds nil-pointer checks across optional fields in the sysfs.PciDevice struct.

Why this is important:

  • Not all sysfs entries exist consistently across kernels, drivers, or hardware types.
  • Without these checks, collector could panic when trying to read missing fields.
  • With them, the collector gracefully skips unavailable data, keeping metrics flowing reliably.

Why This Matters (Especially for Infra Teams)

Better Observability & Context

These enhancements unlock deeper insights into machine hardware, particularly in high-performance, virtualized, or containerized environments:

  • Detect and diagnose NUMA locality issues that impact performance.
  • Monitor power states and wake-up events for PCI devices.
  • Validate and observe SR-IOV setups, critical for NICs and accelerators.
  • Build cleaner dashboards with vendor and device names instead of cryptic hex codes.
  • Rely on robust collectors that won’t break due to missing sysfs entries.

Opt-In by Design

The optional nature of ID-to-name conversion is deliberate: users can enable richer context where needed, without forcing additional dependencies on minimal setups.

Conclusion

At Asama.ai, we believe in strengthening the open-source ecosystem we rely on. Contributing improvements like these ensures broader community benefit and reduces the need to carry private forks.

And yes, we’re always looking for developers who care about infrastructure, observability, and open source.

Cheers,

Jain

jj@asama.ai

SR-IOV, Power State, and NUMA node info in prometheus/procfs

At Asama.ai, we’ve always been committed to giving back to the open-source projects we rely on. Our contributions ensure that others can benefit from the solutions we’ve already implemented.
This not only supports the community, but also reduces the need for us to constantly rebase and maintain forks.

Our latest contribution improves visibility into PCI devices in the prometheus/procfs project. These additions aim to enhance observability for tools that rely on procfs. The author also plans to integrate these enhancements into the node_exporter pcidevice collector.

What Changed: New Fields & Features

Pull Request #748 extends the PciDevice struct with several new fields from sysfs:

  • NUMA node – maps which node the device is attached to
  • SR-IOV support info – number of virtual functions, total VFs, etc.
  • Driver autoprobe flag
  • Offset / Stride values
  • Power state and whether D3Cold is allowed

Before this PR, you needed to write a custom textfile collector with node_exporter to collect these metrics. Once merged, these fields become first-class attributes in the library and can be integrated directly into node_exporter.

What’s Next

Stay tuned — we’re actively working on enhancing the node_exporter pcidevice collector to expose this information as metrics, which we will also contribute back to the community.

And yes — we’re hiring developers who love open source.

Cheers,

Jain

jj@asama.ai