GPU Health & Diagnostics (Part 2): AMD GPUs with AMD SMI

In our previous post, we explored NVIDIA’s DCGM toolkit, explored how it enables real-time GPU health monitoring, diagnostics, and alerts for data-center environments.
In this post, we’ll look at the AMD ecosystem — specifically, how to manage and monitor AMD GPUs using the AMD System Management Interface (SMI).

What Is AMD SMI

AMD SMI (System Management Interface) is the command-line and library toolset that gives you visibility into the operational health of AMD GPUs.
It’s part of the ROCm (Radeon Open Compute) platform and provides low-level access to telemetry, configuration, and diagnostic data.

Think of it as AMD’s closest equivalent to DCGM and it offers:

  • Monitor GPU health and utilization
  • Track power, temperature, and memory stats
  • Perform diagnostics and resets
  • Manage performance states (clocks, power caps)
  • Collect hardware telemetry for observability systems

What AMD SMI Offers: 

Basic GPU Information

The first step in understanding any GPU node is gathering hardware identity and topology details.
AMD SMI enables you to query model names, firmware versions, and interconnect topology, allowing you to know exactly what’s running in your system.

They provide data such as:

  • Product name and SKU: confirms whether the node runs MI300X, MI250, or older Instinct series.
  • Firmware and VBIOS versions: critical for validating driver compatibility across mixed clusters.
  • PCIe configuration and NUMA locality: shows how each GPU is wired to the CPU or switch fabric, helping diagnose I/O bottlenecks and cross-NUMA latency.
  • Topology information: visualize multi-GPU interconnects (xGMI or Infinity Fabric links) for debugging peer-to-peer bandwidth issues.

Having this baseline inventory makes it easier to detect firmware drift, inconsistent driver stacks, or nodes provisioned with mismatched GPUs.

Health & Telemetry

Health telemetry is at the core of day-to-day GPU monitoring.
AMD SMI exposes real-time operational metrics that reflect the GPU’s thermal, electrical, and workload conditions.

It exposes data such as:

  • Temperature sensors: reports edge, junction, and memory temperatures. Rising averages often signal airflow or paste degradation.
  • Power draw and voltage rails: track instantaneous and average consumption versus TDP limits; useful for identifying throttling or PSU saturation.
  • Fan speed and control mode: confirm thermal regulation behavior; fans stuck at fixed RPM indicate firmware or sensor faults.
  • Performance Levels: shows current clock domains and frequency scaling behavior; helpful for detecting down-clocking under load.
  • HBM/VRAM utilisation: monitors buffer pressure and helps correlate performance dips with paging or over-subscription.

Collecting these metrics periodically builds the baseline for drift detectioncapacity planning, and thermal optimization across large GPU clusters.

Diagnostics & Events

AMD GPUs incorporate RAS (Reliability, Availability, and Serviceability) features to detect and report hardware errors before they impact workloads.
AMD SMI provides direct access to this diagnostic layer.

These outputs include:

  • ECC error counts: both correctable and uncorrectable events for VRAM or cache. Spikes often indicate memory degradation or cooling issues.
  • RAS feature status: confirms whether ECC, page retirement, and poison handling are active.
  • Event logs: record GPU resets, hangs, and driver-level recoveries with timestamps.
  • Error categories: classify faults (memory, PCIe, fabric, thermal) to help automate root-cause analysis.

Regularly parsing this data helps you detect failing GPUs early, isolate unstable hosts, and correlate hardware faults with workload or environmental changes.
It’s the diagnostic backbone for proactive maintenance and warranty tracking.

Control & Management

Beyond observation, AMD SMI allows operators to control power, clock, and performance policies directly from the CLI. This is especially valuable for consistent benchmarking, workload tuning, or enforcing cluster-wide limits.

Capabilities include:

  • Power cap management:  define GPU-specific wattage ceilings to balance performance and power budgets across dense racks.
  • Clock domain tuning: lock core and memory frequencies for reproducible benchmarks or stress tests.
  • Fan and thermal control: manually override cooling profiles in lab or diagnostic environments.
  • State resets: revert clocks, power limits, and thermal profiles back to defaults after tests.

In production, these controls are typically automated through orchestration agents or monitoring frameworks (like Asama Compass) that apply safe limits, ensure uniform settings, and trigger remediation when thresholds are exceeded.

Integrating with Monitoring & Automation

The AMD SMI library (libamdsmi.so) allows programmatic access to the same telemetry data, enabling:

  • Exporters for Prometheus or Asama Compass agents
  • Custom health checks (temperature drift, RAS error thresholds)
  • Periodic baseline verification across GPU nodes

For example, a monitoring agent can periodically call the SMI API to record temperature and ECC counts, raising alerts when thresholds deviate from baseline.

Conclusion

AMD System Management Interface (SMI) helps with continuous health monitoring and diagnostics of AMD GPUs. We at Asama.ai are deeply integrated with NVIDIA enabling finding issues and remediating them. 

And yes, we’re always looking for developers who care about infrastructure, observability, and open source.

Clustering for Infrastructure Observability

Imagine walking through a large botanical garden filled with thousands of plants.
Some grow close together in thick patches, others are scattered in smaller clusters, and a few stand alone in remote corners.

If you were asked to group them, you’d probably do it by how similar they look e.g., color, leaf shape, height, or the type of soil they prefer.
You’d quickly notice that some groups are dense and obvious, while others are sparse and loosely related.
A few plants wouldn’t fit anywhere at all.

That’s essentially what clustering does, it identifies natural groupings within a large, unlabeled space.
It doesn’t need prior knowledge or fixed categories. Instead, it observes how things naturally relate to each other and organises them accordingly.

Some groups are strong and well-defined, others are weaker or short-lived, and some points don’t belong anywhere — they’re outliers.
The goal is simple: find structure inside apparent randomness.

What Is Clustering

Clustering is the process of automatically grouping similar data points.

For engineers, here is what it means:

  • Anomaly detection: grouping normal vs abnormal signals
  • Workload segmentation: clustering GPUs, jobs, or nodes by behaviour
  • Log and metric deduplication: grouping repeating fault patterns
  • Embedding analysis: grouping semantically similar vectors

In short, clustering helps convert noisy telemetry into structured, actionable insight.

How Clustering Works

At its core, every clustering method follows  steps:

  1. Measure similarity: Define how close two data points are using a distance metric (e.g., Euclidean, cosine).
  2. Group related points: Combine points that are close together into clusters.
  3. Evaluate stability: Check how cohesive and distinct each cluster is, and merge or split as needed.It’s about the structure that helps you reason about your data.

Two Broad Types of Clustering

There are many algorithms, but most fall into two broad categories:

1. Flat Clustering

Flat algorithms create a single partition of the data — every point belongs to exactly one cluster.
They’re simple and efficient but can struggle when cluster densities vary or when data has complex shapes.
Examples: k-means, k-medoids.

2. Hierarchical Clustering

Hierarchical algorithms build a tree (dendrogram) of clusters, capturing how groups form, merge, or split at different similarity levels.
This approach helps reveal structure at multiple scales — from fine-grained subgroups to broader categories.
Examples: BIRCH, HDBSCAN.

Evaluating Clustering Quality

Because clustering is unsupervised, there’s no single “correct” answer.
Common ways to assess cluster quality include:

  • Silhouette score:  measures how well points fit within their cluster vs others.
  • Davies–Bouldin index: compares internal cohesion to external separation.
  • Stability checks:  ensure clusters persist across different samples.
  • Visualization: project high-dimensional data into 2D (UMAP or t-SNE) to inspect structure and overlap.

How Asama AI uses Clustering

At Asama, we constantly observe a diverse range of devices to ensure optimal performance and proactively identify potential issues. The sheer volume and complexity of our infrastructure necessitate sophisticated tools for anomaly detection. This is where  Clustering proves invaluable.

We use various techniques for clustering based on various operational metrics, specs, and various other parameters. By grouping similar devices and their activities, we can effectively identify outliers that deviate significantly from their respective clusters. These deviations often indicate anomalies that could signal performance degradation, security breaches, or other operational issues.

The primary goal of this clustering and anomaly-detection process is to enable logical, efficient remediation. Once an anomaly is detected, our teams can quickly pinpoint the affected devices, understand the nature of the deviation, and take targeted actions to resolve the issue before it escalates.

References

  1. https://www.youtube.com/watch?v=7xHsRkOdVwo
  2. https://www.youtube.com/watch?v=4AW_5nYQkuc
  3. https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html

Summary

Clustering is the foundation of unsupervised learning, enabling us to discover natural order within complexity.
It helps systems, researchers, and engineers uncover structure where none was labelled before.

GPU Health & Diagnostics: Why They Matter

When operating GPU infrastructure—whether for AI, ML, HPC, rendering, or simulations—your uptime and performance depend on knowing whether your GPUs are healthy before they fail catastrophically. Faulty memory lanes, ECC errors, power instability, or thermal issues can degrade performance or cause silent errors.

In this post and subsequent posts, we explore what NVIDIA and AMD offer and how they can be used in your environment. 

NVIDIA’s DCGM: The toolkit that to provide real-time health checks, diagnostics, and alerts for NVIDIA GPU fleets. The health and diagnostic features help you:

  • Detect latent hardware or configuration issues early
  • Automate routine validation and alerting
  • Correlate hardware-level failures with workload anomalies

In this post, I’ll walk through how DCGM enables diagnostics and health monitoring—what it offers, how it works, and what to watch out for.

What DCGM Offers: 

Continuous Health Monitoring

For infrastructure engineers, health monitoring is baseline hygiene.
DCGM tracks everything that matters at the silicon level:

  • Memory bandwidth & ECC checks — catch degradation early
  • Thermal drift — detect cooling failures and hotspots
  • NVLink integrity — ensure interconnect reliability
  • Power stability — monitor rails, transients, and throttling

These continuous checks are non-invasive, low-overhead, and essential for keeping a GPU cluster in steady state.

Diagnostics

Monitoring tells you something’s wrong.
Diagnostics tell you what and why.

DCGM diagnostics are invasive — they stress and validate every GPU subsystem.
Ideal for:

  • Maintenance windows
  • Burn-in testing
  • Root-cause analysis

They uncover:

  • Deployment and driver issues
  • Integration or container runtime conflicts
  • Stress-induced thermal/power anomalies
  • Hardware-level faults (PCIe, VRAM, regulators)

How Diagnostics Work: Levels & Workflows

Diagnostic Levels

DCGM supports multiple diagnostic “levels” (e.g. Level 1 through Level 4). The idea is:

  • Level 1 / 2: lightweight, fast sanity checks (good for frequent run)
  • Level 3 / 4: deeper stress / memory / link tests (for maintenance windows or postmortem) 

You choose a level depending on how deep you want to go and how long you can afford the test to run.

Running Diagnostics via DCGMI

DCGMI is the CLI front end for DCGM. Example commands:

dcgmi diag -r 1         # run level 1 diagnostic
dcgmi diag -r 4         # run deepest diagnostic (if supported)
dcgmi health -s a       # start health monitoring on all GPUs
dcgmi health -c         # query current health status

You can also tailor diagnostics by adjusting parameters (e.g., memory thresholds, enabling or disabling specific tests).

Conclusion

NVIDIA’s DCGM is a toolkit that helps with continuous health monitoring and Diagnostics of NVIDIA’s GPUs. We at Asama.ai are deeply integrated with NVIDIA enabling finding issues and remediating them. 

And yes, we’re always looking for developers who care about infrastructure, observability, and open source.

Additional PCI Device Metrics in Prometheus Node Exporter

In continuation of our previous post, we are bringing additional features to the pcidevice collector in node_exporter, through PR #3425.

This enhancement builds on three key themes:

  • Extended PCI Metrics (powered by PR #748)
  • Translating Numeric IDs into Human-Friendly Names
  • Improved Stability via Nil-Pointer Checks

What’s New

1. Extended PCI Metrics

The pcidevice collector is now enriched with several new fields:

  • NUMA node – identifies which NUMA node the device is attached to.
  • SR-IOV details – reports the number of virtual functions, total VFs, etc.
  • Driver autoprobe flag – tracks whether driver probing is enabled.
  • Power state & D3Cold – exposes device power state and low-power capability.

Previously, collecting this data required a custom textfile collector. With this update, these attributes become first-class citizens in Node Exporter.

2. ID → Name Conversion

A highly ergonomic addition: numeric PCI IDs (vendor, device, class) can now be optionally mapped to human-readable names.

Example:

  • Before → {vendor_id=0x8086}
  • After → {vendor_id=0x8086, vendor_name="Intel Corporation"}

This mapping relies on the system’s pci.ids file (or a user-specified alternative). It’s disabled by default to minimize overhead, but for dashboards, debug logs, and alerts, the readability boost is huge.

3. Nil-Pointer Checks

Alongside the new features, PR #3425 adds nil-pointer checks across optional fields in the sysfs.PciDevice struct.

Why this is important:

  • Not all sysfs entries exist consistently across kernels, drivers, or hardware types.
  • Without these checks, collector could panic when trying to read missing fields.
  • With them, the collector gracefully skips unavailable data, keeping metrics flowing reliably.

Why This Matters (Especially for Infra Teams)

Better Observability & Context

These enhancements unlock deeper insights into machine hardware, particularly in high-performance, virtualized, or containerized environments:

  • Detect and diagnose NUMA locality issues that impact performance.
  • Monitor power states and wake-up events for PCI devices.
  • Validate and observe SR-IOV setups, critical for NICs and accelerators.
  • Build cleaner dashboards with vendor and device names instead of cryptic hex codes.
  • Rely on robust collectors that won’t break due to missing sysfs entries.

Opt-In by Design

The optional nature of ID-to-name conversion is deliberate: users can enable richer context where needed, without forcing additional dependencies on minimal setups.

Conclusion

At Asama.ai, we believe in strengthening the open-source ecosystem we rely on. Contributing improvements like these ensures broader community benefit and reduces the need to carry private forks.

And yes, we’re always looking for developers who care about infrastructure, observability, and open source.

Cheers,

Jain

jj@asama.ai

Unpopular Opinion: Do not build what customers want

The problem of plenty is a good problem to have. The problem of having plenty of problems? The more the better. The way we see it, each problem is also an opportunity. 

But what do you do when you have plenty of opportunities? Which one do you prioritize first? Well, we have all learnt Pareto principle in our schools, colleges etc. Quick revision: Choose the 20% problems that cause 80% of the mayhem. Simply put, choose the one that would drive the maximum impact. 

Or is it?

Imagine this: You are a real estate developer. Which part of the building is most lucrative for your customer or brings the most eyeballs? The club house, the balcony view or maybe the interiors? Loads of growth gurus telling you – “build what customers want”. But is that what the customer really wants? A big club house or others? The first implied want is that the house should last. The customer does not say it. And no good builder will ask the customers if they want a building that lasts! At least I hope they do not. It is implied and understood.

So, despite the fact that customers might get glittery eyes about that deck balcony or the chic interiors, the builder builds the foundation first. Without the foundation, nothing sells – irrespective of the impact some of the features may have. And that’s what we have chosen to do – to build the foundation first.

Very few understand the convoluted nature of the problem that’s holding the infra industry back, fewer appreciate how crucial it is to build a strong foundation first before we take on the bigger pieces.

Here’s the foundation of the industry that’s missing imho:

No customer relies solely on a single hardware provider. Almost all firms & colleagues that I have worked with and spoke to in our lifetimes, at least have a dual sourcing strategy as a supply chain risk management measure, most have a multi-vendor sourcing strategy – which makes a heterogeneous environment a reality for more than 95% of the end customers. The rest 5% are heavily dependent on their server supplier – results in unwarranted hand twisting because you are at the mercy of your supplier. Could have been easily avoided if right risk management policies were in place/followed.

Anyway.

To manage their servers, some server providers (not every) provides their own dashboard to its customers. Some of these dashboards are fairly shallow, while some are marginally better. Most of these dashboards enable you to log in to one server at a time, most do not let you perform cluster level operations. Simplest example is – restarting 200 servers at a time. You literally need another set of console to do that and cannot do it from the server management dashboard.

Overall, the dashboards provided by server manufacturers are (a) not good enough to manage the servers they came along with, (b) not at all compatible with the entire heterogeneous fleet as they won’t talk to other makes & models; (c) non existent in some cases – ODMs fail miserably despite a fairly competitive hardware. So what does the customer do? Well, there are two customer personas here – (1) IT team, & (2) IT Infrastructure team. The IT team is a more tactical team – focusing on managing the distributed office network, laptops, other devices and sometimes even a back office like the call centre or warehouse. The IT Infrastructure team is the one that manages prod environments/platforms and are slightly more strategic in nature. 

Now, to the question of what do our customers do:

  1. The IT team buys something called an IT Asset Management (ITAM) or a Data Center Infrastructure Management (DCIM) tool. There are some pretty solid asset management tools out there in the market that solves the “asset management” or “inventory management” aspect of the problem – which IT teams are generally concerned with. And hence, these ITAM/DCIM tools augur well with them.
  2. The IT infra team goes into a loop – goes into a loop – struggles with ITAM/server dashboards, writes scripts to manage their problems themselves, code breaks, struggle continues. ITAM/DCIM is not sufficient as they are designed more from an inventory management perspective rather than health/performance monitoring pov. And they can’t dig in because they are dwelling on the surface. 

So what are we doing?

I will keep this one simple – building a single pane of glass that gives the IT Infrastructure teams complete visibility of their heterogeneous server fleet. Think of us as no nonsense, one stop deep visibility into your rather unlooked server boxes.

How are we doing it?

That’s a secret! Book a demo to know more. Shout to us @ manish@asama.ai / sg@asama.ai.