GPU Health & Diagnostics: Why They Matter

When operating GPU infrastructure—whether for AI, ML, HPC, rendering, or simulations—your uptime and performance depend on knowing whether your GPUs are healthy before they fail catastrophically. Faulty memory lanes, ECC errors, power instability, or thermal issues can degrade performance or cause silent errors.

In this post and subsequent posts, we explore what NVIDIA and AMD offer and how they can be used in your environment. 

NVIDIA’s DCGM: The toolkit that to provide real-time health checks, diagnostics, and alerts for NVIDIA GPU fleets. The health and diagnostic features help you:

  • Detect latent hardware or configuration issues early
  • Automate routine validation and alerting
  • Correlate hardware-level failures with workload anomalies

In this post, I’ll walk through how DCGM enables diagnostics and health monitoring—what it offers, how it works, and what to watch out for.

What DCGM Offers: 

Continuous Health Monitoring

For infrastructure engineers, health monitoring is baseline hygiene.
DCGM tracks everything that matters at the silicon level:

  • Memory bandwidth & ECC checks — catch degradation early
  • Thermal drift — detect cooling failures and hotspots
  • NVLink integrity — ensure interconnect reliability
  • Power stability — monitor rails, transients, and throttling

These continuous checks are non-invasive, low-overhead, and essential for keeping a GPU cluster in steady state.

Diagnostics

Monitoring tells you something’s wrong.
Diagnostics tell you what and why.

DCGM diagnostics are invasive — they stress and validate every GPU subsystem.
Ideal for:

  • Maintenance windows
  • Burn-in testing
  • Root-cause analysis

They uncover:

  • Deployment and driver issues
  • Integration or container runtime conflicts
  • Stress-induced thermal/power anomalies
  • Hardware-level faults (PCIe, VRAM, regulators)

How Diagnostics Work: Levels & Workflows

Diagnostic Levels

DCGM supports multiple diagnostic “levels” (e.g. Level 1 through Level 4). The idea is:

  • Level 1 / 2: lightweight, fast sanity checks (good for frequent run)
  • Level 3 / 4: deeper stress / memory / link tests (for maintenance windows or postmortem) 

You choose a level depending on how deep you want to go and how long you can afford the test to run.

Running Diagnostics via DCGMI

DCGMI is the CLI front end for DCGM. Example commands:

dcgmi diag -r 1         # run level 1 diagnostic
dcgmi diag -r 4         # run deepest diagnostic (if supported)
dcgmi health -s a       # start health monitoring on all GPUs
dcgmi health -c         # query current health status

You can also tailor diagnostics by adjusting parameters (e.g., memory thresholds, enabling or disabling specific tests).

Conclusion

NVIDIA’s DCGM is a toolkit that helps with continuous health monitoring and Diagnostics of NVIDIA’s GPUs. We at Asama.ai are deeply integrated with NVIDIA enabling finding issues and remediating them. 

And yes, we’re always looking for developers who care about infrastructure, observability, and open source.

Unpopular Opinion: Do not build what customers want

The problem of plenty is a good problem to have. The problem of having plenty of problems? The more the better. The way we see it, each problem is also an opportunity. 

But what do you do when you have plenty of opportunities? Which one do you prioritize first? Well, we have all learnt Pareto principle in our schools, colleges etc. Quick revision: Choose the 20% problems that cause 80% of the mayhem. Simply put, choose the one that would drive the maximum impact. 

Or is it?

Imagine this: You are a real estate developer. Which part of the building is most lucrative for your customer or brings the most eyeballs? The club house, the balcony view or maybe the interiors? Loads of growth gurus telling you – “build what customers want”. But is that what the customer really wants? A big club house or others? The first implied want is that the house should last. The customer does not say it. And no good builder will ask the customers if they want a building that lasts! At least I hope they do not. It is implied and understood.

So, despite the fact that customers might get glittery eyes about that deck balcony or the chic interiors, the builder builds the foundation first. Without the foundation, nothing sells – irrespective of the impact some of the features may have. And that’s what we have chosen to do – to build the foundation first.

Very few understand the convoluted nature of the problem that’s holding the infra industry back, fewer appreciate how crucial it is to build a strong foundation first before we take on the bigger pieces.

Here’s the foundation of the industry that’s missing imho:

No customer relies solely on a single hardware provider. Almost all firms & colleagues that I have worked with and spoke to in our lifetimes, at least have a dual sourcing strategy as a supply chain risk management measure, most have a multi-vendor sourcing strategy – which makes a heterogeneous environment a reality for more than 95% of the end customers. The rest 5% are heavily dependent on their server supplier – results in unwarranted hand twisting because you are at the mercy of your supplier. Could have been easily avoided if right risk management policies were in place/followed.

Anyway.

To manage their servers, some server providers (not every) provides their own dashboard to its customers. Some of these dashboards are fairly shallow, while some are marginally better. Most of these dashboards enable you to log in to one server at a time, most do not let you perform cluster level operations. Simplest example is – restarting 200 servers at a time. You literally need another set of console to do that and cannot do it from the server management dashboard.

Overall, the dashboards provided by server manufacturers are (a) not good enough to manage the servers they came along with, (b) not at all compatible with the entire heterogeneous fleet as they won’t talk to other makes & models; (c) non existent in some cases – ODMs fail miserably despite a fairly competitive hardware. So what does the customer do? Well, there are two customer personas here – (1) IT team, & (2) IT Infrastructure team. The IT team is a more tactical team – focusing on managing the distributed office network, laptops, other devices and sometimes even a back office like the call centre or warehouse. The IT Infrastructure team is the one that manages prod environments/platforms and are slightly more strategic in nature. 

Now, to the question of what do our customers do:

  1. The IT team buys something called an IT Asset Management (ITAM) or a Data Center Infrastructure Management (DCIM) tool. There are some pretty solid asset management tools out there in the market that solves the “asset management” or “inventory management” aspect of the problem – which IT teams are generally concerned with. And hence, these ITAM/DCIM tools augur well with them.
  2. The IT infra team goes into a loop – goes into a loop – struggles with ITAM/server dashboards, writes scripts to manage their problems themselves, code breaks, struggle continues. ITAM/DCIM is not sufficient as they are designed more from an inventory management perspective rather than health/performance monitoring pov. And they can’t dig in because they are dwelling on the surface. 

So what are we doing?

I will keep this one simple – building a single pane of glass that gives the IT Infrastructure teams complete visibility of their heterogeneous server fleet. Think of us as no nonsense, one stop deep visibility into your rather unlooked server boxes.

How are we doing it?

That’s a secret! Book a demo to know more. Shout to us @ manish@asama.ai / sg@asama.ai.

How did we name our start-up – Asama.ai?

We were drowned with customer interviews right from day 1. Whatever was left of the time in the day, we spent it discussing interviews/planning follow up questions. By the time we got to naming ourselves (which was probably 2-3 weeks in), our thoughts had crystallized on what values we wanted to drive when our customers thought of us – intelligence, innovation and excellence. With these in mind, we set to name ourselves and went about noting just about anything that represented these ideas. Eventually, we shortlisted 3 names on the following themes:

1. Spirit Animal: Having spent some time in auto space, I had seen motorcycles designed around different spirit animals. Think of Dominar 400 as Tauras (dominating and powerful), Pulsar 200 as Cheetah (sleek and fast)
Some brands use their spirit animal directly in their names; for ex – Redbull, Puma, Jaguar, Survey monkey, Mail chimp, Task rabbit etc. Whereas some brands like Lacoste, Swarovski etc. use their spirit animal in their logo (crocodile and swan to convey their values – tenacity and elegance respectively). We decided of translate similar thought in our name and that led us to two options:

a. Orca (or Killer whales) – Orcas come from dolphin family and are considered very intelligent. They are known for their innovative hunting techniques and unique vocal patterns they use to communicate. In short, killer whales are killer smart! We were psyched by the prospect of naming ourselves ‘Smartorca’ among others.

b. Raven – Ravens are considered very intelligent and a higher order bird (no wonder Game of Thrones used it to convey intelligence as a trait). Their cognitive capabilities rival apes, have immense social learning and problem-solving skills, and are meticulous planners. We thought of naming ourselves ‘Sharpey’ and use a raven in our logo (‘Sharpey’ being the name of our mascot), in order to convey the idea of a sharp raven!

2. Sanskrit word: Staying true to our Indian roots, we wanted to come up with a name that translated to the values of excellence and if possible, came from Sanskrit. Several home-grown brands have also used this strategy – think Krutrim, Vistara, Lakme, Myntra etc. And that’s how we landed on ‘Asama’ (असम) – Asama, in Sanskrit, translates to ‘that which does not have an equal’ or ‘that which is beyond comparison’. The word ‘Asama’ really augured well with us and resembled closely with what we are trying to build – unparalleled solution to modern infrastructure problems. That’s how we are – Asama.ai now.

General trivia – The word ‘Asama’ also seems to have played a role in naming India’s state of ‘Assam’.

Want to know what we are upto? Let’s connect at manish@asama.ai / sg@asama.ai.

Until then.

Keep Slaying,

Manish