NVIDIA NVML GPU Statistics

Introduction

The NVIDIA official utility nvidia-smi provides a lot of useful information about the GPU. It is built on top of the NVIDIA Management Library (NVML), which provides a set of APIs for monitoring the statistics of NVIDIA GPUs. In practice, sometimes we would like to monitor the GPU statistics in our custom applications.

In this blog post, I would like to discuss how to use NVIDIA NVML library to monitor the GPU statistics and replicated nvidia-smi dmon in a custom C++ application.

NVIDIA NVML GPU Statistics

NVIDIA-SMI DMON

nvidia-smi dmon will display basic GPU statistics, including power (pwr), GPU temperature (gtemp), memory temperature (mtemp), GPU utilization (sm) (the percentage of time that at least one SM is being used), memory utilization (mem), encoder utilization (enc), decoder utilization (dec), JPEG utilization (jpg), OFA utilization (ofa), memory clock (mclk), and graphics clock (pclk). The following is an example of nvidia-smi dmon output:

1
2
3
4
5
6
$ nvidia-smi dmon
# gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk
# Idx W C C % % % % % % MHz MHz
0 8 42 - 1 11 0 0 0 0 405 502
0 14 43 - 0 1 0 0 0 0 7001 1492
0 15 43 - 0 1 0 0 0 0 7001 1492

In addition, nvidia-smi dmon can also display additional GPU Performance Metrics (GPM) for Hopper and later GPUs. The following example shows how to display the GPMs for GPU activity (gract) (same as the sm metric), SM utilization (smutil) (the percentage of SMs that are actively being used), and FP16 activity (fp16).

1
2
3
4
5
6
7
$ nvidia-smi dmon --gpm-metrics 1,2,13
# gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk gract smutil fp16
# Idx W C C % % % % % % MHz MHz GPM:% GPM:% GPM:%
0 12 43 - 0 1 0 0 0 0 7001 682 - - -
0 12 43 - 0 1 0 0 0 0 810 532 - - -
0 11 43 - 2 7 0 0 0 0 810 495 1 1 0
0 9 43 - 1 7 0 0 0 0 810 502 9 8 0

The other GPM metrics can be queried using nvidia-smi dmon --help.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
$ nvidia-smi dmon --help

GPU statistics are displayed in scrolling format with one line
per sampling interval. Metrics to be monitored can be adjusted
based on the width of terminal window. Monitoring is limited to
a maximum of 16 devices. If no devices are specified, then up to
first 16 supported devices under natural enumeration (starting
with GPU index 0) are used for monitoring purpose.
It is supported on Tesla, GRID, Quadro and limited GeForce products
for Kepler or newer GPUs under x64 and ppc64 bare metal Linux.
Note: On MIG-enabled GPUs, querying the utilization of encoder,
decoder, jpeg, ofa, gpu, and memory is not currently supported.

Usage: nvidia-smi dmon [options]

Options include:
[-i | --id]: Comma separated Enumeration index, PCI bus ID or UUID
[-d | --delay]: Collection delay/interval in seconds [default=1sec]
[-c | --count]: Collect specified number of samples and exit
[-s | --select]: One or more metrics [default=puc]
Can be any of the following:
p - Power Usage and Temperature
u - Utilization
c - Proc and Mem Clocks
v - Power and Thermal Violations
m - FB, Bar1 and CC Protected Memory
e - ECC Errors and PCIe Replay errors
t - PCIe Rx and Tx Throughput
[N/A | --gpm-metrics]: Comma-separated list of GPM metrics (no space in between) to watch
Available metrics:
Graphics Activity = 1
SM Activity = 2
SM Occupancy = 3
Integer Activity = 4
Tensor Activity = 5
DFMA Tensor Activity = 6
HMMA Tensor Activity = 7
IMMA Tensor Activity = 9
DRAM Activity = 10
FP64 Activity = 11
FP32 Activity = 12
FP16 Activity = 13
PCIe TX = 20
PCIe RX = 21
NVDEC 0-7 Activity = 30-37
NVJPG 0-7 Activity = 40-47
NVOFA 0 Activity = 50
NVLink Total RX = 60
NVLink Total TX = 61
NVLink L0-17 RX = 62,64,66,...,96
NVLink L0-17 TX = 63,65,67,...,97
C2C TOTAL TX = 100
C2C TOTAL RX = 101
C2C DATA TX = 102
C2C DATA RX = 103
C2C LINK0-13 TOTAL TX = 104,108,112,...,156
C2C LINK0-13 TOTAL RX = 105,109,113,...,157
C2C LINK0-13 DATA TX = 106,110,114,...,158
C2C LINK0-13 DATA RX = 107,111,115,...,159
HOSTMEM CACHE HIT = 160
HOSTMEM CACHE MISS = 161
PEERMEM CACHE HIT = 162
PEERMEM CACHE MISS = 163
DRAM CACHE HIT = 164
DRAM CACHE MISS = 165
NVENC 0-3 Activity = 166-169
GR0-7 CTXSW CYCLES ELAPSED = 170,175,180,...,205
GR0-7 CTXSW CYCLES ACTIVE = 171,176,181,...,206
GR0-7 CTXSW REQUESTS = 172,177,182,...,207
GR0-7 CTXSW ACTIVE AVERAGE = 173,178,183,...,208
GR0-7 CTXSW ACTIVE PERCENT = 174,179,184,...,209

[N/A | --gpm-options]: options of which level of GPM metrics to monitor:
d - Display Device level GPM Metrics only
m - Display MIG level GPM Metrics only
dm - Display both Device and MIG level GPM Metrics only
md - Display both Device and MIG level GPM Metrics only
[-o | --options]: One or more from the following:
D - Include Date (YYYYMMDD) in scrolling output
T - Include Time (HH:MM:SS) in scrolling output
[-f | --filename]: Log to a specified file, rather than to stdout
[-h | --help]: Display help information
[N/A | --format]: Output format specifiers:
csv - Format dmon output as a CSV
nounit - Remove units line from dmon output
noheader - Remove heading line from dmon output

GPU Stats Using NVIDIA NVML

In turns out that we could query the basic GPU statistics, including sm, mem, enc, dec, jpg, ofa using the nvmlDeviceGetProcessesUtilizationInfo API, and the additional GPM statistics, including every GPM metric listed in the nvidia-smi dmon --help, using the nvmlGpmMetricsGet API. All the GPM metric IDs can be found in the nvmlGpmMetricId_t definition. For example, the GPM metric ID for gract is NVML_GPM_METRIC_GRAPHICS_UTIL = 1, for smutil is NVML_GPM_METRIC_SM_UTIL = 2, and for fp16 is NVML_GPM_METRIC_FP16_UTIL = 13.

The following gpu_stats program demonstrates the usage of the NVIDIA NVML library APIs mentioned above and will produce exactly the same output as nvidia-smi dmon. The source code is also available in the “NVIDIA NVML GPU Statistics” repository on GitHub.

gpu_stats.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
#include <chrono>
#include <cstring>
#include <iomanip>
#include <iostream>
#include <map>
#include <memory>
#include <sstream>
#include <string>
#include <thread>
#include <vector>

#include <nvml.h>

// GPM Metric Name Mapping
std::map<int, std::string> gpmMetricNames = {
{1, "gract"}, {2, "smutil"}, {3, "smoccu"}, {4, "intact"}, {5, "tenact"},
{6, "dfmact"}, {7, "hmmact"}, {9, "immact"}, {10, "dramac"}, {11, "fp64"},
{12, "fp32"}, {13, "fp16"}, {20, "pcitx"}, {21, "pcirx"}, {30, "nvd0"},
{31, "nvd1"}, {32, "nvd2"}, {33, "nvd3"}, {34, "nvd4"}, {35, "nvd5"},
{36, "nvd6"}, {37, "nvd7"}, {40, "nvj0"}, {41, "nvj1"}, {42, "nvj2"},
{43, "nvj3"}, {44, "nvj4"}, {45, "nvj5"}, {46, "nvj6"}, {47, "nvj7"},
{50, "ofa0"}, {60, "nvlrx"}, {61, "nvltx"}};

struct GPUStats
{
unsigned int power; // Power in watts
unsigned int gpuTemp; // GPU temperature in Celsius
int memTemp; // Memory temperature in Celsius (-1 if not available)
unsigned int smUtil; // SM utilization %
unsigned int memUtil; // Memory utilization %
unsigned int encUtil; // Encoder utilization %
unsigned int decUtil; // Decoder utilization %
unsigned int jpgUtil; // JPEG decoder utilization %
unsigned int ofaUtil; // OFA utilization %
unsigned int memClock; // Memory clock in MHz
unsigned int smClock; // SM clock in MHz
std::map<int, double> gpmMetrics; // GPM metrics
};

void printError(char const* func, nvmlReturn_t const result)
{
std::cerr << "Error in " << func << ": " << nvmlErrorString(result)
<< std::endl;
}

bool getUtilization(nvmlDevice_t const device, GPUStats& stats)
{
// Initialize all utilization values to 0
stats.smUtil = 0;
stats.memUtil = 0;
stats.encUtil = 0;
stats.decUtil = 0;
stats.jpgUtil = 0;
stats.ofaUtil = 0;

// Try to use nvmlDeviceGetProcessesUtilizationInfo to get all metrics at
// once
nvmlProcessesUtilizationInfo_t procUtilInfo{};
memset(&procUtilInfo, 0, sizeof(procUtilInfo));
procUtilInfo.version = nvmlProcessesUtilizationInfo_v1;
procUtilInfo.lastSeenTimeStamp = 0;

// First call to determine the buffer size needed
nvmlReturn_t result{
nvmlDeviceGetProcessesUtilizationInfo(device, &procUtilInfo)};

if (result == NVML_ERROR_INSUFFICIENT_SIZE &&
procUtilInfo.processSamplesCount > 0)
{
// Allocate buffer for process utilization samples
std::vector<nvmlProcessUtilizationInfo_v1_t> procUtilArray(
procUtilInfo.processSamplesCount);
procUtilInfo.procUtilArray = procUtilArray.data();

result = nvmlDeviceGetProcessesUtilizationInfo(device, &procUtilInfo);

if (result == NVML_SUCCESS)
{
// Aggregate utilization across all processes (take maximum)
for (unsigned int i{0}; i < procUtilInfo.processSamplesCount; ++i)
{
stats.smUtil = std::max(stats.smUtil, procUtilArray[i].smUtil);
stats.memUtil =
std::max(stats.memUtil, procUtilArray[i].memUtil);
stats.encUtil =
std::max(stats.encUtil, procUtilArray[i].encUtil);
stats.decUtil =
std::max(stats.decUtil, procUtilArray[i].decUtil);
stats.jpgUtil =
std::max(stats.jpgUtil, procUtilArray[i].jpgUtil);
stats.ofaUtil =
std::max(stats.ofaUtil, procUtilArray[i].ofaUtil);
}
return true;
}
}

// Fallback to individual API calls if nvmlDeviceGetProcessesUtilizationInfo
// not available
nvmlUtilization_t utilization{};
result = nvmlDeviceGetUtilizationRates(device, &utilization);
if (result == NVML_SUCCESS)
{
stats.smUtil = utilization.gpu;
stats.memUtil = utilization.memory;
}

unsigned int encoderUtil{}, encoderSamplingPeriod{};
result = nvmlDeviceGetEncoderUtilization(device, &encoderUtil,
&encoderSamplingPeriod);
if (result == NVML_SUCCESS)
{
stats.encUtil = encoderUtil;
}

unsigned int decoderUtil{}, decoderSamplingPeriod{};
result = nvmlDeviceGetDecoderUtilization(device, &decoderUtil,
&decoderSamplingPeriod);
if (result == NVML_SUCCESS)
{
stats.decUtil = decoderUtil;
}

return true;
}

bool getGPUStats(nvmlDevice_t const device, GPUStats& stats)
{
nvmlReturn_t result{};

// Get power
result = nvmlDeviceGetPowerUsage(device, &stats.power);
if (result != NVML_SUCCESS)
{
stats.power = 0;
}
else
{
stats.power /= 1000; // Convert from milliwatts to watts
}

// Get GPU temperature
result =
nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &stats.gpuTemp);
if (result != NVML_SUCCESS)
{
stats.gpuTemp = 0;
}

// Memory temperature is not available via standard NVML API
stats.memTemp = -1;

// Get utilization
getUtilization(device, stats);

// Get clocks
result = nvmlDeviceGetClockInfo(device, NVML_CLOCK_MEM, &stats.memClock);
if (result != NVML_SUCCESS)
{
stats.memClock = 0;
}

result = nvmlDeviceGetClockInfo(device, NVML_CLOCK_SM, &stats.smClock);
if (result != NVML_SUCCESS)
{
stats.smClock = 0;
}

return true;
}

bool getGPMMetrics(nvmlDevice_t const device, std::vector<int> const& metricIds,
GPUStats& stats)
{
if (metricIds.empty())
{
return true;
}

// Custom deleter for GPM samples
auto gpmSampleDeleter = [](nvmlGpmSample_t* sample)
{
if (sample && *sample)
{
nvmlGpmSampleFree(*sample);
}
delete sample;
};

// Allocate GPM samples with RAII
std::unique_ptr<nvmlGpmSample_t, decltype(gpmSampleDeleter)> sample1(
new nvmlGpmSample_t{}, gpmSampleDeleter);
nvmlReturn_t result{nvmlGpmSampleAlloc(sample1.get())};
if (result != NVML_SUCCESS)
{
// GPM not supported
for (int id : metricIds)
{
stats.gpmMetrics[id] = -1.0;
}
return false;
}

std::unique_ptr<nvmlGpmSample_t, decltype(gpmSampleDeleter)> sample2(
new nvmlGpmSample_t{}, gpmSampleDeleter);
result = nvmlGpmSampleAlloc(sample2.get());
if (result != NVML_SUCCESS)
{
for (int id : metricIds)
{
stats.gpmMetrics[id] = -1.0;
}
return false;
}

// Get first sample
result = nvmlGpmSampleGet(device, *sample1);
if (result != NVML_SUCCESS)
{
for (int id : metricIds)
{
stats.gpmMetrics[id] = -1.0;
}
return false;
}

// Wait for at least 100ms
std::this_thread::sleep_for(std::chrono::milliseconds(100));

// Get second sample
result = nvmlGpmSampleGet(device, *sample2);
if (result != NVML_SUCCESS)
{
for (int id : metricIds)
{
stats.gpmMetrics[id] = -1.0;
}
return false;
}

// Prepare metrics get structure
nvmlGpmMetricsGet_t metricsGet{};
memset(&metricsGet, 0, sizeof(metricsGet));
metricsGet.version = NVML_GPM_METRICS_GET_VERSION;
metricsGet.numMetrics = metricIds.size();
metricsGet.sample1 = *sample1;
metricsGet.sample2 = *sample2;

// Fill metrics array (it's a fixed-size array in the struct)
for (size_t i{0}; i < metricIds.size() && i < 210; ++i)
{
metricsGet.metrics[i].metricId =
static_cast<nvmlGpmMetricId_t>(metricIds[i]);
}

// Get metrics
result = nvmlGpmMetricsGet(&metricsGet);
if (result == NVML_SUCCESS)
{
for (size_t i{0}; i < metricIds.size(); ++i)
{
stats.gpmMetrics[metricIds[i]] = metricsGet.metrics[i].value;
}
}
else
{
for (int id : metricIds)
{
stats.gpmMetrics[id] = -1.0;
}
}

return result == NVML_SUCCESS;
}

void printHeader(std::vector<int> const& gpmMetricIds)
{
std::cout << "# gpu pwr gtemp mtemp sm mem enc dec "
"jpg ofa mclk pclk";
for (int id : gpmMetricIds)
{
if (gpmMetricNames.find(id) != gpmMetricNames.end())
{
std::cout << std::setw(11) << gpmMetricNames[id];
}
}
std::cout << std::endl;

std::cout << "# Idx W C C % % % % "
"% % MHz MHz";
for (size_t i{0}; i < gpmMetricIds.size(); ++i)
{
std::cout << " GPM:%";
}
std::cout << std::endl;
}

void printStats(unsigned int const deviceId, GPUStats const& stats,
std::vector<int> const& gpmMetricIds)
{
std::cout << std::setw(5) << deviceId;
std::cout << std::setw(7) << stats.power;
std::cout << std::setw(7) << stats.gpuTemp;

if (stats.memTemp >= 0)
{
std::cout << std::setw(7) << stats.memTemp;
}
else
{
std::cout << std::setw(7) << "-";
}

std::cout << std::setw(7) << stats.smUtil;
std::cout << std::setw(7) << stats.memUtil;
std::cout << std::setw(7) << stats.encUtil;
std::cout << std::setw(7) << stats.decUtil;
std::cout << std::setw(7) << stats.jpgUtil;
std::cout << std::setw(7) << stats.ofaUtil;
std::cout << std::setw(7) << stats.memClock;
std::cout << std::setw(7) << stats.smClock;

for (int id : gpmMetricIds)
{
if (stats.gpmMetrics.find(id) != stats.gpmMetrics.end())
{
double const value{stats.gpmMetrics.at(id)};
if (value < 0)
{
std::cout << std::setw(11) << "-";
}
else
{
std::cout << std::setw(11) << static_cast<int>(value);
}
}
else
{
std::cout << std::setw(11) << "-";
}
}

std::cout << std::endl;
}

std::vector<int> parseGpmMetrics(std::string const& str)
{
std::vector<int> metrics{};
std::stringstream ss{str};
std::string token{};

while (std::getline(ss, token, ','))
{
try
{
metrics.push_back(std::stoi(token));
}
catch (...)
{
std::cerr << "Invalid GPM metric ID: " << token << std::endl;
}
}

return metrics;
}

int main(int argc, char* argv[])
{
std::vector<int> gpmMetricIds{};
int delay{1}; // Default 1 second
int count{-1}; // Infinite by default

// Parse command line arguments
for (int i{1}; i < argc; ++i)
{
std::string const arg{argv[i]};
if (arg == "--gpm-metrics" && i + 1 < argc)
{
gpmMetricIds = parseGpmMetrics(argv[++i]);
}
else if ((arg == "-d" || arg == "--delay") && i + 1 < argc)
{
delay = std::atoi(argv[++i]);
}
else if ((arg == "-c" || arg == "--count") && i + 1 < argc)
{
count = std::atoi(argv[++i]);
}
else if (arg == "-h" || arg == "--help")
{
std::cout << "Usage: " << argv[0] << " [options]" << std::endl
<< "Options:" << std::endl
<< " --gpm-metrics <ids> Comma-separated list of GPM "
"metric IDs"
<< std::endl
<< " -d, --delay <sec> Collection delay/interval in "
"seconds [default=1]"
<< std::endl
<< " -c, --count <n> Collect specified number of "
"samples and exit"
<< std::endl
<< " -h, --help Display this help"
<< std::endl;
return 0;
}
}

// Initialize NVML
nvmlReturn_t result{nvmlInit()};
if (result != NVML_SUCCESS)
{
printError("nvmlInit", result);
return 1;
}

// Get device count
unsigned int deviceCount{};
result = nvmlDeviceGetCount(&deviceCount);
if (result != NVML_SUCCESS)
{
printError("nvmlDeviceGetCount", result);
nvmlShutdown();
return 1;
}

if (deviceCount == 0)
{
std::cerr << "No NVIDIA GPUs found" << std::endl;
nvmlShutdown();
return 1;
}

// Get device handles
std::vector<nvmlDevice_t> devices(deviceCount);
for (unsigned int i{0}; i < deviceCount; ++i)
{
result = nvmlDeviceGetHandleByIndex(i, &devices[i]);
if (result != NVML_SUCCESS)
{
printError("nvmlDeviceGetHandleByIndex", result);
nvmlShutdown();
return 1;
}
}

// Print header
printHeader(gpmMetricIds);

// Main monitoring loop
int iteration{0};
while (count < 0 || iteration < count)
{
for (unsigned int i{0}; i < deviceCount; ++i)
{
GPUStats stats{};
getGPUStats(devices[i], stats);

// Get GPM metrics if requested
if (!gpmMetricIds.empty())
{
getGPMMetrics(devices[i], gpmMetricIds, stats);
}

printStats(i, stats, gpmMetricIds);
}

iteration++;
if (count < 0 || iteration < count)
{
std::this_thread::sleep_for(std::chrono::seconds(delay));
}
}

// Shutdown NVML
nvmlShutdown();

return 0;
}

The gpu_stats program can be built and run using the following commands.

1
2
3
4
5
6
7
8
9
10
11
# Build the application
$ g++ -o gpu_stats gpu_stats.cpp -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lnvidia-ml
# Mimic nvidia-smi dmon --gpm-metrics 1,2,13
$ ./gpu_stats --gpm-metrics 1,2,13
# gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk gract smutil fp16
# Idx W C C % % % % % % MHz MHz GPM:% GPM:% GPM:%
0 19 34 - 3 2 0 0 0 0 7001 712 13 11 0
0 19 34 - 3 2 0 0 0 0 7001 637 11 8 0
0 18 34 - 3 2 0 0 0 0 7001 667 9 10 0
0 19 34 - 3 2 0 0 0 0 7001 577 8 10 0
0 18 34 - 3 2 0 0 0 0 7001 615 7 8 0

References

Author

Lei Mao

Posted on

12-25-2025

Updated on

12-25-2025

Licensed under


Comments