NVIDIA NVML GPU Statistics

12-25-202512-25-2025 blog 15 minutes read (About 2214 words) visits

Introduction

The NVIDIA official utility nvidia-smi provides a lot of useful information about the GPU. It is built on top of the NVIDIA Management Library (NVML), which provides a set of APIs for monitoring the statistics of NVIDIA GPUs. In practice, sometimes we would like to monitor the GPU statistics in our custom applications.

In this blog post, I would like to discuss how to use NVIDIA NVML library to monitor the GPU statistics and replicated nvidia-smi dmon in a custom C++ application.

NVIDIA NVML GPU Statistics

NVIDIA-SMI DMON

nvidia-smi dmon will display basic GPU statistics, including power (pwr), GPU temperature (gtemp), memory temperature (mtemp), GPU utilization (sm) (the percentage of time that at least one SM is being used), memory utilization (mem), encoder utilization (enc), decoder utilization (dec), JPEG utilization (jpg), OFA utilization (ofa), memory clock (mclk), and graphics clock (pclk). The following is an example of nvidia-smi dmon output:

$ nvidia-smi dmon
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk
# Idx      W      C      C      %      %      %      %      %      %    MHz    MHz
    0      8     42      -      1     11      0      0      0      0    405    502
    0     14     43      -      0      1      0      0      0      0   7001   1492
    0     15     43      -      0      1      0      0      0      0   7001   1492

In addition, nvidia-smi dmon can also display additional GPU Performance Metrics (GPM) for Hopper and later GPUs. The following example shows how to display the GPMs for GPU activity (gract) (same as the sm metric), SM utilization (smutil) (the percentage of SMs that are actively being used), and FP16 activity (fp16).

$ nvidia-smi dmon --gpm-metrics 1,2,13
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk      gract     smutil       fp16
# Idx      W      C      C      %      %      %      %      %      %    MHz    MHz      GPM:%      GPM:%      GPM:%
    0     12     43      -      0      1      0      0      0      0   7001    682          -          -          -
    0     12     43      -      0      1      0      0      0      0    810    532          -          -          -
    0     11     43      -      2      7      0      0      0      0    810    495          1          1          0
    0      9     43      -      1      7      0      0      0      0    810    502          9          8          0

The other GPM metrics can be queried using nvidia-smi dmon --help.

$ nvidia-smi dmon --help

    GPU statistics are displayed in scrolling format with one line
    per sampling interval. Metrics to be monitored can be adjusted
    based on the width of terminal window. Monitoring is limited to
    a maximum of 16 devices. If no devices are specified, then up to
    first 16 supported devices under natural enumeration (starting
    with GPU index 0) are used for monitoring purpose.
    It is supported on Tesla, GRID, Quadro and limited GeForce products
    for Kepler or newer GPUs under x64 and ppc64 bare metal Linux.
    Note: On MIG-enabled GPUs, querying the utilization of encoder,
    decoder, jpeg, ofa, gpu, and memory is not currently supported.

    Usage: nvidia-smi dmon [options]

    Options include:
    [-i | --id]:          Comma separated Enumeration index, PCI bus ID or UUID
    [-d | --delay]:       Collection delay/interval in seconds [default=1sec]
    [-c | --count]:       Collect specified number of samples and exit
    [-s | --select]:      One or more metrics [default=puc]
                          Can be any of the following:
                              p - Power Usage and Temperature
                              u - Utilization
                              c - Proc and Mem Clocks
                              v - Power and Thermal Violations
                              m - FB, Bar1 and CC Protected Memory
                              e - ECC Errors and PCIe Replay errors
                              t - PCIe Rx and Tx Throughput
    [N/A | --gpm-metrics]: Comma-separated list of GPM metrics (no space in between) to watch
                           Available metrics:
                               Graphics Activity           = 1
                               SM Activity                 = 2
                               SM Occupancy                = 3
                               Integer Activity            = 4
                               Tensor Activity             = 5
                               DFMA Tensor Activity        = 6
                               HMMA Tensor Activity        = 7
                               IMMA Tensor Activity        = 9
                               DRAM Activity               = 10
                               FP64 Activity               = 11
                               FP32 Activity               = 12
                               FP16 Activity               = 13
                               PCIe TX                     = 20
                               PCIe RX                     = 21
                               NVDEC 0-7 Activity          = 30-37
                               NVJPG 0-7 Activity          = 40-47
                               NVOFA 0 Activity            = 50
                               NVLink Total RX             = 60
                               NVLink Total TX             = 61
                               NVLink L0-17 RX             = 62,64,66,...,96
                               NVLink L0-17 TX             = 63,65,67,...,97
                               C2C TOTAL TX                = 100
                               C2C TOTAL RX                = 101
                               C2C DATA TX                 = 102
                               C2C DATA RX                 = 103
                               C2C LINK0-13 TOTAL TX       = 104,108,112,...,156
                               C2C LINK0-13 TOTAL RX       = 105,109,113,...,157
                               C2C LINK0-13 DATA TX        = 106,110,114,...,158
                               C2C LINK0-13 DATA RX        = 107,111,115,...,159
                               HOSTMEM CACHE HIT           = 160
                               HOSTMEM CACHE MISS          = 161
                               PEERMEM CACHE HIT           = 162
                               PEERMEM CACHE MISS          = 163
                               DRAM CACHE HIT              = 164
                               DRAM CACHE MISS             = 165
                               NVENC 0-3 Activity          = 166-169
                               GR0-7 CTXSW CYCLES ELAPSED  = 170,175,180,...,205
                               GR0-7 CTXSW CYCLES ACTIVE   = 171,176,181,...,206
                               GR0-7 CTXSW REQUESTS        = 172,177,182,...,207
                               GR0-7 CTXSW ACTIVE AVERAGE  = 173,178,183,...,208
                               GR0-7 CTXSW ACTIVE PERCENT  = 174,179,184,...,209

    [N/A | --gpm-options]: options of which level of GPM metrics to monitor:
                              d  - Display Device level GPM Metrics only
                              m  - Display MIG level GPM Metrics only
                              dm - Display both Device and MIG level GPM Metrics only
                              md - Display both Device and MIG level GPM Metrics only
    [-o | --options]:     One or more from the following:
                              D - Include Date (YYYYMMDD) in scrolling output
                              T - Include Time (HH:MM:SS) in scrolling output
    [-f | --filename]:    Log to a specified file, rather than to stdout
    [-h | --help]:        Display help information
    [N/A | --format]:     Output format specifiers:
                               csv - Format dmon output as a CSV
                               nounit - Remove units line from dmon output
                               noheader - Remove heading line from dmon output

GPU Stats Using NVIDIA NVML

In turns out that we could query the basic GPU statistics, including sm, mem, enc, dec, jpg, ofa using the nvmlDeviceGetProcessesUtilizationInfo API, and the additional GPM statistics, including every GPM metric listed in the nvidia-smi dmon --help, using the nvmlGpmMetricsGet API. All the GPM metric IDs can be found in the nvmlGpmMetricId_t definition. For example, the GPM metric ID for gract is NVML_GPM_METRIC_GRAPHICS_UTIL = 1, for smutil is NVML_GPM_METRIC_SM_UTIL = 2, and for fp16 is NVML_GPM_METRIC_FP16_UTIL = 13.

The following gpu_stats program demonstrates the usage of the NVIDIA NVML library APIs mentioned above and will produce exactly the same output as nvidia-smi dmon. The source code is also available in the “NVIDIA NVML GPU Statistics” repository on GitHub.

gpu_stats.cpp

#include <chrono>
#include <cstring>
#include <iomanip>
#include <iostream>
#include <map>
#include <memory>
#include <sstream>
#include <string>
#include <thread>
#include <vector>

#include <nvml.h>

// GPM Metric Name Mapping
std::map<int, std::string> gpmMetricNames = {
    {1, "gract"},  {2, "smutil"}, {3, "smoccu"}, {4, "intact"},  {5, "tenact"},
    {6, "dfmact"}, {7, "hmmact"}, {9, "immact"}, {10, "dramac"}, {11, "fp64"},
    {12, "fp32"},  {13, "fp16"},  {20, "pcitx"}, {21, "pcirx"},  {30, "nvd0"},
    {31, "nvd1"},  {32, "nvd2"},  {33, "nvd3"},  {34, "nvd4"},   {35, "nvd5"},
    {36, "nvd6"},  {37, "nvd7"},  {40, "nvj0"},  {41, "nvj1"},   {42, "nvj2"},
    {43, "nvj3"},  {44, "nvj4"},  {45, "nvj5"},  {46, "nvj6"},   {47, "nvj7"},
    {50, "ofa0"},  {60, "nvlrx"}, {61, "nvltx"}};

struct GPUStats
{
    unsigned int power;   // Power in watts
    unsigned int gpuTemp; // GPU temperature in Celsius
    int memTemp;          // Memory temperature in Celsius (-1 if not available)
    unsigned int smUtil;  // SM utilization %
    unsigned int memUtil; // Memory utilization %
    unsigned int encUtil; // Encoder utilization %
    unsigned int decUtil; // Decoder utilization %
    unsigned int jpgUtil; // JPEG decoder utilization %
    unsigned int ofaUtil; // OFA utilization %
    unsigned int memClock;            // Memory clock in MHz
    unsigned int smClock;             // SM clock in MHz
    std::map<int, double> gpmMetrics; // GPM metrics
};

void printError(char const* func, nvmlReturn_t const result)
{
    std::cerr << "Error in " << func << ": " << nvmlErrorString(result)
              << std::endl;
}

bool getUtilization(nvmlDevice_t const device, GPUStats& stats)
{
    // Initialize all utilization values to 0
    stats.smUtil = 0;
    stats.memUtil = 0;
    stats.encUtil = 0;
    stats.decUtil = 0;
    stats.jpgUtil = 0;
    stats.ofaUtil = 0;

    // Try to use nvmlDeviceGetProcessesUtilizationInfo to get all metrics at
    // once
    nvmlProcessesUtilizationInfo_t procUtilInfo{};
    memset(&procUtilInfo, 0, sizeof(procUtilInfo));
    procUtilInfo.version = nvmlProcessesUtilizationInfo_v1;
    procUtilInfo.lastSeenTimeStamp = 0;

    // First call to determine the buffer size needed
    nvmlReturn_t result{
        nvmlDeviceGetProcessesUtilizationInfo(device, &procUtilInfo)};

    if (result == NVML_ERROR_INSUFFICIENT_SIZE &&
        procUtilInfo.processSamplesCount > 0)
    {
        // Allocate buffer for process utilization samples
        std::vector<nvmlProcessUtilizationInfo_v1_t> procUtilArray(
            procUtilInfo.processSamplesCount);
        procUtilInfo.procUtilArray = procUtilArray.data();

        result = nvmlDeviceGetProcessesUtilizationInfo(device, &procUtilInfo);

        if (result == NVML_SUCCESS)
        {
            // Aggregate utilization across all processes (take maximum)
            for (unsigned int i{0}; i < procUtilInfo.processSamplesCount; ++i)
            {
                stats.smUtil = std::max(stats.smUtil, procUtilArray[i].smUtil);
                stats.memUtil =
                    std::max(stats.memUtil, procUtilArray[i].memUtil);
                stats.encUtil =
                    std::max(stats.encUtil, procUtilArray[i].encUtil);
                stats.decUtil =
                    std::max(stats.decUtil, procUtilArray[i].decUtil);
                stats.jpgUtil =
                    std::max(stats.jpgUtil, procUtilArray[i].jpgUtil);
                stats.ofaUtil =
                    std::max(stats.ofaUtil, procUtilArray[i].ofaUtil);
            }
            return true;
        }
    }

    // Fallback to individual API calls if nvmlDeviceGetProcessesUtilizationInfo
    // not available
    nvmlUtilization_t utilization{};
    result = nvmlDeviceGetUtilizationRates(device, &utilization);
    if (result == NVML_SUCCESS)
    {
        stats.smUtil = utilization.gpu;
        stats.memUtil = utilization.memory;
    }

    unsigned int encoderUtil{}, encoderSamplingPeriod{};
    result = nvmlDeviceGetEncoderUtilization(device, &encoderUtil,
                                             &encoderSamplingPeriod);
    if (result == NVML_SUCCESS)
    {
        stats.encUtil = encoderUtil;
    }

    unsigned int decoderUtil{}, decoderSamplingPeriod{};
    result = nvmlDeviceGetDecoderUtilization(device, &decoderUtil,
                                             &decoderSamplingPeriod);
    if (result == NVML_SUCCESS)
    {
        stats.decUtil = decoderUtil;
    }

    return true;
}

bool getGPUStats(nvmlDevice_t const device, GPUStats& stats)
{
    nvmlReturn_t result{};

    // Get power
    result = nvmlDeviceGetPowerUsage(device, &stats.power);
    if (result != NVML_SUCCESS)
    {
        stats.power = 0;
    }
    else
    {
        stats.power /= 1000; // Convert from milliwatts to watts
    }

    // Get GPU temperature
    result =
        nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &stats.gpuTemp);
    if (result != NVML_SUCCESS)
    {
        stats.gpuTemp = 0;
    }

    // Memory temperature is not available via standard NVML API
    stats.memTemp = -1;

    // Get utilization
    getUtilization(device, stats);

    // Get clocks
    result = nvmlDeviceGetClockInfo(device, NVML_CLOCK_MEM, &stats.memClock);
    if (result != NVML_SUCCESS)
    {
        stats.memClock = 0;
    }

    result = nvmlDeviceGetClockInfo(device, NVML_CLOCK_SM, &stats.smClock);
    if (result != NVML_SUCCESS)
    {
        stats.smClock = 0;
    }

    return true;
}

bool getGPMMetrics(nvmlDevice_t const device, std::vector<int> const& metricIds,
                   GPUStats& stats)
{
    if (metricIds.empty())
    {
        return true;
    }

    // Custom deleter for GPM samples
    auto gpmSampleDeleter = [](nvmlGpmSample_t* sample)
    {
        if (sample && *sample)
        {
            nvmlGpmSampleFree(*sample);
        }
        delete sample;
    };

    // Allocate GPM samples with RAII
    std::unique_ptr<nvmlGpmSample_t, decltype(gpmSampleDeleter)> sample1(
        new nvmlGpmSample_t{}, gpmSampleDeleter);
    nvmlReturn_t result{nvmlGpmSampleAlloc(sample1.get())};
    if (result != NVML_SUCCESS)
    {
        // GPM not supported
        for (int id : metricIds)
        {
            stats.gpmMetrics[id] = -1.0;
        }
        return false;
    }

    std::unique_ptr<nvmlGpmSample_t, decltype(gpmSampleDeleter)> sample2(
        new nvmlGpmSample_t{}, gpmSampleDeleter);
    result = nvmlGpmSampleAlloc(sample2.get());
    if (result != NVML_SUCCESS)
    {
        for (int id : metricIds)
        {
            stats.gpmMetrics[id] = -1.0;
        }
        return false;
    }

    // Get first sample
    result = nvmlGpmSampleGet(device, *sample1);
    if (result != NVML_SUCCESS)
    {
        for (int id : metricIds)
        {
            stats.gpmMetrics[id] = -1.0;
        }
        return false;
    }

    // Wait for at least 100ms
    std::this_thread::sleep_for(std::chrono::milliseconds(100));

    // Get second sample
    result = nvmlGpmSampleGet(device, *sample2);
    if (result != NVML_SUCCESS)
    {
        for (int id : metricIds)
        {
            stats.gpmMetrics[id] = -1.0;
        }
        return false;
    }

    // Prepare metrics get structure
    nvmlGpmMetricsGet_t metricsGet{};
    memset(&metricsGet, 0, sizeof(metricsGet));
    metricsGet.version = NVML_GPM_METRICS_GET_VERSION;
    metricsGet.numMetrics = metricIds.size();
    metricsGet.sample1 = *sample1;
    metricsGet.sample2 = *sample2;

    // Fill metrics array (it's a fixed-size array in the struct)
    for (size_t i{0}; i < metricIds.size() && i < 210; ++i)
    {
        metricsGet.metrics[i].metricId =
            static_cast<nvmlGpmMetricId_t>(metricIds[i]);
    }

    // Get metrics
    result = nvmlGpmMetricsGet(&metricsGet);
    if (result == NVML_SUCCESS)
    {
        for (size_t i{0}; i < metricIds.size(); ++i)
        {
            stats.gpmMetrics[metricIds[i]] = metricsGet.metrics[i].value;
        }
    }
    else
    {
        for (int id : metricIds)
        {
            stats.gpmMetrics[id] = -1.0;
        }
    }

    return result == NVML_SUCCESS;
}

void printHeader(std::vector<int> const& gpmMetricIds)
{
    std::cout << "# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    "
                 "jpg    ofa   mclk   pclk";
    for (int id : gpmMetricIds)
    {
        if (gpmMetricNames.find(id) != gpmMetricNames.end())
        {
            std::cout << std::setw(11) << gpmMetricNames[id];
        }
    }
    std::cout << std::endl;

    std::cout << "# Idx      W      C      C      %      %      %      %      "
                 "%      %    MHz    MHz";
    for (size_t i{0}; i < gpmMetricIds.size(); ++i)
    {
        std::cout << "      GPM:%";
    }
    std::cout << std::endl;
}

void printStats(unsigned int const deviceId, GPUStats const& stats,
                std::vector<int> const& gpmMetricIds)
{
    std::cout << std::setw(5) << deviceId;
    std::cout << std::setw(7) << stats.power;
    std::cout << std::setw(7) << stats.gpuTemp;

    if (stats.memTemp >= 0)
    {
        std::cout << std::setw(7) << stats.memTemp;
    }
    else
    {
        std::cout << std::setw(7) << "-";
    }

    std::cout << std::setw(7) << stats.smUtil;
    std::cout << std::setw(7) << stats.memUtil;
    std::cout << std::setw(7) << stats.encUtil;
    std::cout << std::setw(7) << stats.decUtil;
    std::cout << std::setw(7) << stats.jpgUtil;
    std::cout << std::setw(7) << stats.ofaUtil;
    std::cout << std::setw(7) << stats.memClock;
    std::cout << std::setw(7) << stats.smClock;

    for (int id : gpmMetricIds)
    {
        if (stats.gpmMetrics.find(id) != stats.gpmMetrics.end())
        {
            double const value{stats.gpmMetrics.at(id)};
            if (value < 0)
            {
                std::cout << std::setw(11) << "-";
            }
            else
            {
                std::cout << std::setw(11) << static_cast<int>(value);
            }
        }
        else
        {
            std::cout << std::setw(11) << "-";
        }
    }

    std::cout << std::endl;
}

std::vector<int> parseGpmMetrics(std::string const& str)
{
    std::vector<int> metrics{};
    std::stringstream ss{str};
    std::string token{};

    while (std::getline(ss, token, ','))
    {
        try
        {
            metrics.push_back(std::stoi(token));
        }
        catch (...)
        {
            std::cerr << "Invalid GPM metric ID: " << token << std::endl;
        }
    }

    return metrics;
}

int main(int argc, char* argv[])
{
    std::vector<int> gpmMetricIds{};
    int delay{1};  // Default 1 second
    int count{-1}; // Infinite by default

    // Parse command line arguments
    for (int i{1}; i < argc; ++i)
    {
        std::string const arg{argv[i]};
        if (arg == "--gpm-metrics" && i + 1 < argc)
        {
            gpmMetricIds = parseGpmMetrics(argv[++i]);
        }
        else if ((arg == "-d" || arg == "--delay") && i + 1 < argc)
        {
            delay = std::atoi(argv[++i]);
        }
        else if ((arg == "-c" || arg == "--count") && i + 1 < argc)
        {
            count = std::atoi(argv[++i]);
        }
        else if (arg == "-h" || arg == "--help")
        {
            std::cout << "Usage: " << argv[0] << " [options]" << std::endl
                      << "Options:" << std::endl
                      << "  --gpm-metrics <ids>  Comma-separated list of GPM "
                         "metric IDs"
                      << std::endl
                      << "  -d, --delay <sec>    Collection delay/interval in "
                         "seconds [default=1]"
                      << std::endl
                      << "  -c, --count <n>      Collect specified number of "
                         "samples and exit"
                      << std::endl
                      << "  -h, --help           Display this help"
                      << std::endl;
            return 0;
        }
    }

    // Initialize NVML
    nvmlReturn_t result{nvmlInit()};
    if (result != NVML_SUCCESS)
    {
        printError("nvmlInit", result);
        return 1;
    }

    // Get device count
    unsigned int deviceCount{};
    result = nvmlDeviceGetCount(&deviceCount);
    if (result != NVML_SUCCESS)
    {
        printError("nvmlDeviceGetCount", result);
        nvmlShutdown();
        return 1;
    }

    if (deviceCount == 0)
    {
        std::cerr << "No NVIDIA GPUs found" << std::endl;
        nvmlShutdown();
        return 1;
    }

    // Get device handles
    std::vector<nvmlDevice_t> devices(deviceCount);
    for (unsigned int i{0}; i < deviceCount; ++i)
    {
        result = nvmlDeviceGetHandleByIndex(i, &devices[i]);
        if (result != NVML_SUCCESS)
        {
            printError("nvmlDeviceGetHandleByIndex", result);
            nvmlShutdown();
            return 1;
        }
    }

    // Print header
    printHeader(gpmMetricIds);

    // Main monitoring loop
    int iteration{0};
    while (count < 0 || iteration < count)
    {
        for (unsigned int i{0}; i < deviceCount; ++i)
        {
            GPUStats stats{};
            getGPUStats(devices[i], stats);

            // Get GPM metrics if requested
            if (!gpmMetricIds.empty())
            {
                getGPMMetrics(devices[i], gpmMetricIds, stats);
            }

            printStats(i, stats, gpmMetricIds);
        }

        iteration++;
        if (count < 0 || iteration < count)
        {
            std::this_thread::sleep_for(std::chrono::seconds(delay));
        }
    }

    // Shutdown NVML
    nvmlShutdown();

    return 0;
}

The gpu_stats program can be built and run using the following commands.

# Build the application
$ g++ -o gpu_stats gpu_stats.cpp -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lnvidia-ml
# Mimic nvidia-smi dmon --gpm-metrics 1,2,13
$ ./gpu_stats --gpm-metrics 1,2,13
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk      gract     smutil       fp16
# Idx      W      C      C      %      %      %      %      %      %    MHz    MHz      GPM:%      GPM:%      GPM:%
    0     19     34      -      3      2      0      0      0      0   7001    712         13         11          0
    0     19     34      -      3      2      0      0      0      0   7001    637         11          8          0
    0     18     34      -      3      2      0      0      0      0   7001    667          9         10          0
    0     19     34      -      3      2      0      0      0      0   7001    577          8         10          0
    0     18     34      -      3      2      0      0      0      0   7001    615          7          8          0

References

NVIDIA NVML GPU Statistics

https://leimao.github.io/blog/NVIDIA-NVML-GPU-Statistics/

Author

Lei Mao

Posted on

12-25-2025

Updated on

12-25-2025

Licensed under

CPP,

CUDA,

NVIDIA,

GPU,

NVML

NVIDIA NVML GPU Statistics

Introduction

NVIDIA NVML GPU Statistics

NVIDIA-SMI DMON

GPU Stats Using NVIDIA NVML

References

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Advertisement

Catalogue