To measure the performance of a CUDA kernel, usually the user will run the kernel multiple times and take the average of the execution time. However, the performance of a CUDA kernel can be affected by caching effects, therefore causing the measured performance to be different from the actual performance.
For example, during the performance measurement, in each CUDA kernel call, the CUDA kernel will access from the same input data, resulting reading from the L2 cache without accessing the DRAM, whereas in the actual application, the input data might be different in each kernel call, causing the kernel to read from the DRAM. To remove the caching effects for performance measurement for some specific use cases, the user can flush the GPU L2 cache every time before running the kernel. Consequently, the kernel will always be run in a “cold” state.
In this blog post, I will discuss how to measure the performance of a CUDA kernel in a “hot” state and a “cold” state.
CUDA Performance Hot VS Cold Measurement
In my previous blog post “Function Binding and Performance Measurement”, I have discussed how to measure the performance of a CUDA kernel using function binding. The performance measurement implementation can actually only measure the performance of a CUDA kernel in a “hot” state. In order to measure the performance of a CUDA kernel in a “cold” state, we could modify the implementation a little bit so that the L2 cache is flushed every time before running the kernel.
L2 Cache Flush
There is no API to flush the GPU L2 cache directly in CUDA. However, we can allocate a buffer in the GPU memory that is of the same size as L2 cache and write some values to it. This will cause all the previous cached values in L2 cache to be evicted. The following example shows how to measure the performance of a CUDA kernel in a “hot” state and a “cold” state.
To build and run the example, please run the following commands.
1 2 3 4 5 6 7 8
$ nvcc measure_performance.cu -o measure_performance -std=c++14 $ ./measure_performance Device Name: NVIDIA GeForce RTX 3090 DRAM Size: 23.4365 GB DRAM Peak Bandwitdh: 936.096 GB/s L2 Cache Size: 6 MB Hot Latency: 0.0095 ms Cold Latency: 0.0141 ms
We could see that there is a performance difference between the “hot” state and the “cold” state and the performance difference is due to the caching effects. However, if the kernel is not memory-bound, or the cache size is too small to be beneficial, the performance difference between the “hot” state and the “cold” state might be negligible.
Nsight Compute
It is also quite common to measure the performance of a CUDA kernel using NVIDIA Nsight Compute.
In order to make hardware performance counter value more deterministic, NVIDIA Nsight Compute by default flushes all GPU caches before each replay pass using --cache-control all. As a result, in each pass, the kernel will access a clean cache and the behavior will be as if the kernel was executed in complete isolation.
This behavior might be undesirable for performance analysis, especially if the measurement focuses on a kernel within a larger application execution, and if the collected data targets cache-centric metrics. In this case, you can use --cache-control none to disable flushing of any hardware cache by the tool.
1 2 3 4
$ ncu --help --cache-control arg (=all) Control the behavior of the GPU caches during profiling. Allowed values: all none