CUDA Performance Hot VS Cold Measurement

03-12-202503-12-2025 blog 8 minutes read (About 1200 words) visits

Introduction

To measure the performance of a CUDA kernel, usually the user will run the kernel multiple times and take the average of the execution time. However, the performance of a CUDA kernel can be affected by caching effects, therefore causing the measured performance to be different from the actual performance.

For example, during the performance measurement, in each CUDA kernel call, the CUDA kernel will access from the same input data, resulting reading from the L2 cache without accessing the DRAM, whereas in the actual application, the input data might be different in each kernel call, causing the kernel to read from the DRAM. To remove the caching effects for performance measurement for some specific use cases, the user can flush the GPU L2 cache every time before running the kernel. Consequently, the kernel will always be run in a “cold” state.

In this blog post, I will discuss how to measure the performance of a CUDA kernel in a “hot” state and a “cold” state.

CUDA Performance Hot VS Cold Measurement

In my previous blog post “Function Binding and Performance Measurement”, I have discussed how to measure the performance of a CUDA kernel using function binding. The performance measurement implementation can actually only measure the performance of a CUDA kernel in a “hot” state. In order to measure the performance of a CUDA kernel in a “cold” state, we could modify the implementation a little bit so that the L2 cache is flushed every time before running the kernel.

L2 Cache Flush

There is no API to flush the GPU L2 cache directly in CUDA. However, we can allocate a buffer in the GPU memory that is of the same size as L2 cache and write some values to it. This will cause all the previous cached values in L2 cache to be evicted. The following example shows how to measure the performance of a CUDA kernel in a “hot” state and a “cold” state.

measure_performance.cpp

#include <functional>
#include <iomanip>
#include <iostream>

#include <cuda_runtime.h>

#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
void check(cudaError_t err, char const* func, char const* file, int line)
{
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
        std::exit(EXIT_FAILURE);
    }
}

template <class T>
float measure_performance(std::function<T(cudaStream_t)> bound_function,
                          cudaStream_t stream, size_t num_repeats = 100,
                          size_t num_warmups = 100, bool flush_l2_cache = false)
{
    int device_id{0};
    int l2_cache_size{0};
    CHECK_CUDA_ERROR(cudaGetDevice(&device_id));
    CHECK_CUDA_ERROR(cudaDeviceGetAttribute(&l2_cache_size,
                                            cudaDevAttrL2CacheSize, device_id));

    void* l2_flush_buffer{nullptr};
    CHECK_CUDA_ERROR(
        cudaMalloc(&l2_flush_buffer, static_cast<size_t>(l2_cache_size)));

    cudaEvent_t start, stop;
    float time{0.0f};
    float call_time{0.0f};

    CHECK_CUDA_ERROR(cudaEventCreate(&start));
    CHECK_CUDA_ERROR(cudaEventCreate(&stop));

    for (size_t i{0}; i < num_warmups; ++i)
    {
        bound_function(stream);
    }

    CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));

    for (size_t i{0}; i < num_repeats; ++i)
    {
        if (flush_l2_cache)
        {
            CHECK_CUDA_ERROR(cudaMemsetAsync(l2_flush_buffer, 0,
                                             static_cast<size_t>(l2_cache_size),
                                             stream));
            CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));
        }
        CHECK_CUDA_ERROR(cudaEventRecord(start, stream));
        CHECK_CUDA_ERROR(bound_function(stream));
        CHECK_CUDA_ERROR(cudaEventRecord(stop, stream));
        CHECK_CUDA_ERROR(cudaEventSynchronize(stop));
        CHECK_CUDA_ERROR(cudaEventElapsedTime(&call_time, start, stop));
        time += call_time;
    }
    CHECK_CUDA_ERROR(cudaEventDestroy(start));
    CHECK_CUDA_ERROR(cudaEventDestroy(stop));

    CHECK_CUDA_ERROR(cudaFree(l2_flush_buffer));

    float const latency{time / num_repeats};

    return latency;
}

__global__ void copy(float* output, float const* input, size_t n)
{
    size_t const idx{blockDim.x * blockIdx.x + threadIdx.x};
    size_t const stride{blockDim.x * gridDim.x};
    for (size_t i{idx}; i < n; i += stride)
    {
        output[i] = input[i];
    }
}

cudaError_t launch_copy(float* output, float const* input, size_t n,
                        cudaStream_t stream)
{
    dim3 const threads_per_block{1024};
    dim3 const blocks_per_grid{32};
    copy<<<blocks_per_grid, threads_per_block, 0, stream>>>(output, input, n);
    return cudaGetLastError();
}

int main()
{
    int device_id{0};
    CHECK_CUDA_ERROR(cudaGetDevice(&device_id));
    cudaDeviceProp device_prop;
    CHECK_CUDA_ERROR(cudaGetDeviceProperties(&device_prop, device_id));
    std::cout << "Device Name: " << device_prop.name << std::endl;
    float const memory_size{static_cast<float>(device_prop.totalGlobalMem) /
                            (1 << 30)};
    std::cout << "DRAM Size: " << memory_size << " GB" << std::endl;
    float const peak_bandwidth{
        static_cast<float>(2.0f * device_prop.memoryClockRate *
                           (device_prop.memoryBusWidth / 8) / 1.0e6)};
    std::cout << "DRAM Peak Bandwitdh: " << peak_bandwidth << " GB/s"
              << std::endl;
    int const l2_cache_size{device_prop.l2CacheSize};
    float const l2_cache_size_mb{static_cast<float>(l2_cache_size) / (1 << 20)};
    std::cout << "L2 Cache Size: " << l2_cache_size_mb << " MB" << std::endl;

    constexpr size_t num_repeats{10000};
    constexpr size_t num_warmups{1000};

    size_t const n{l2_cache_size / 2 / sizeof(float)};
    cudaStream_t stream;

    float *d_input, *d_output;

    CHECK_CUDA_ERROR(cudaMalloc(&d_input, n * sizeof(float)));
    CHECK_CUDA_ERROR(cudaMalloc(&d_output, n * sizeof(float)));

    CHECK_CUDA_ERROR(cudaStreamCreate(&stream));

    std::function<cudaError_t(cudaStream_t)> function{
        std::bind(launch_copy, d_output, d_input, n, std::placeholders::_1)};

    float const hot_latency{
        measure_performance(function, stream, num_repeats, num_warmups, false)};
    std::cout << std::fixed << std::setprecision(4)
              << "Hot Latency: " << hot_latency << " ms" << std::endl;

    float const cold_latency{
        measure_performance(function, stream, num_repeats, num_warmups, true)};
    std::cout << std::fixed << std::setprecision(4)
              << "Cold Latency: " << cold_latency << " ms" << std::endl;

    CHECK_CUDA_ERROR(cudaFree(d_input));
    CHECK_CUDA_ERROR(cudaFree(d_output));
    CHECK_CUDA_ERROR(cudaStreamDestroy(stream));
}

To build and run the example, please run the following commands.

$ nvcc measure_performance.cu -o measure_performance -std=c++14
$ ./measure_performance
Device Name: NVIDIA GeForce RTX 3090
DRAM Size: 23.4365 GB
DRAM Peak Bandwitdh: 936.096 GB/s
L2 Cache Size: 6 MB
Hot Latency: 0.0095 ms
Cold Latency: 0.0141 ms

We could see that there is a performance difference between the “hot” state and the “cold” state and the performance difference is due to the caching effects. However, if the kernel is not memory-bound, or the cache size is too small to be beneficial for the problem, the performance difference between the “hot” state and the “cold” state might be negligible.

Nsight Compute

It is also quite common to measure the performance of a CUDA kernel using NVIDIA Nsight Compute.

In order to make hardware performance counter value more deterministic, NVIDIA Nsight Compute by default flushes all GPU caches before each replay pass using --cache-control all. As a result, in each pass, the kernel will access a clean cache and the behavior will be as if the kernel was executed in complete isolation.

This behavior might be undesirable for performance analysis, especially if the measurement focuses on a kernel within a larger application execution, and if the collected data targets cache-centric metrics. In this case, you can use --cache-control none to disable flushing of any hardware cache by the tool.

$ ncu --help
  --cache-control arg (=all)            Control the behavior of the GPU caches during profiling. Allowed values:
                                          all
                                          none

References

CUDA Performance Hot VS Cold Measurement

https://leimao.github.io/blog/CUDA-Performance-Hot-Cold-Measurement/

Author

Lei Mao

Posted on

03-12-2025

Updated on

03-12-2025

Licensed under

CPP,

CUDA,

NVIDIA,

GPU,

Nsight Compute

CUDA Performance Hot VS Cold Measurement

Introduction

CUDA Performance Hot VS Cold Measurement

L2 Cache Flush

Nsight Compute

References

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Advertisement

Catalogue