Starting with CUDA 11.0, devices of compute capability 8.0 and above have the capability to influence persistence of data in the L2 cache. Because L2 cache is on-chip, it potentially provides higher bandwidth and lower latency accesses to global memory.
In this blog post, I created a CUDA example to demonstrate the how to use the L2 persistent cache to accelerate the data traffic.
CUDA L2 Persistent Cache
In this example, I would have a small constant buffer of certain values that will be used for resetting a large streaming buffer. For example, if the constant buffer is of size 4 and has values of [5, 2, 1, 4] and the large streaming buffer to be reset is of size 100, after resetting the large streaming buffer will have values of [5, 2, 1, 4, 5, 2, 1, 4, ...], namely repeating the values of the constant buffer.
Because the streaming buffer is much larger than the constant buffer, each element from the constant buffer is accessed more often than the streaming buffer. Accessing buffer from global memory is very expensive. If we could cache the frequently accessed constant buffer in L2 cache, the access to the frequently accessed constant buffer could be accelerated.
CUDA Data Resetting
For the data resetting CUDA kernel, I created a baseline which launches the kernel without using persistent L2 cache, a variant which launches the kernel using 3 MB persistent L2 cache but has data thrashing when the constant buffer size exceeds 3 MB, and a optimized variant which launches the kernel using 3 MB persistent L2 cache but the data thrashing was eliminated.
/** * @brief Reset the data_streaming using lut_persistent so that the * data_streaming is lut_persistent repeatedly. * * @param data_streaming The data for reseting. * @param lut_persistent The values for resetting data_streaming. * @param data_streaming_size The size for data_streaming. * @param lut_persistent_size The size for lut_persistent. * @param stream The CUDA stream. */ voidlaunch_reset_data(int* data_streaming, intconst* lut_persistent, size_t data_streaming_size, size_t lut_persistent_size, cudaStream_t stream) { dim3 const threads_per_block{1024}; dim3 const blocks_per_grid{32}; reset_data<<<blocks_per_grid, threads_per_block, 0, stream>>>( data_streaming, lut_persistent, data_streaming_size, lut_persistent_size); CHECK_LAST_CUDA_ERROR(); }
boolverify_data(int* data, int n, size_t size) { for (size_t i{0}; i < size; ++i) { if (data[i] != i % n) { returnfalse; } } returntrue; }
To avoid data thrashing, the product of accessPolicyWindow.hitRatio and accessPolicyWindow.num_bytes should be less than or equal to the cudaLimitPersistingL2CacheSize. The accessPolicyWindow.hitRatio parameter can be used to specify the fraction of accesses that receive the accessPolicyWindow.hitProp property, which is usually cudaAccessPropertyPersisting. The accessPolicyWindow.num_bytes parameter can be used to specify the number of bytes that the access policy window covers, which is usually the size of the persistent data.
In practice, we could set the accessPolicyWindow.hitRatio to be the ratio of the persistent L2 cache size to the persistent data size. For example, if the the persistent L2 cache size is 3 MB and the persistent data size is 4 MB, we could set the accessPolicyWindow.hitRatio to be 3 / 4 = 0.75.
Run CUDA Data Resetting
We could build and run the example on an NVIDIA Ampere GPU. In my case, I used an NVIDIA RTX 3090 GPU.
1 2 3 4 5 6 7 8 9 10
$ nvcc l2-persistent.cu -o l2-persistent -std=c++14 --gpu-architecture=compute_80 $ ./l2-persistent GPU: NVIDIA GeForce RTX 3090 L2 Cache Size: 6 MB Max Persistent L2 Cache Size: 4 MB Persistent Data Size: 3 MB Steaming Data Size: 1024 MB Latency Without Using Persistent L2 Cache: 3.071 ms Latency With Using 3 MB Persistent L2 Cache (Potentially Thrashing): 2.436 ms Latency With Using 3 MB Persistent L2 Cache (Non-Thrashing): 2.443 ms
We could see that when the persistent data size is 3 MB and the persistent L2 cache is 3 MB, the performance of the application is improved by roughly 20%.
Benchmarking
We could also run some mini benchmarking by varying the persistent data size.
$ ./l2-persistent 1 GPU: NVIDIA GeForce RTX 3090 L2 Cache Size: 6 MB Max Persistent L2 Cache Size: 4 MB Persistent Data Size: 1 MB Steaming Data Size: 1024 MB Latency Without Using Persistent L2 Cache: 1.754 ms Latency With Using 3 MB Persistent L2 Cache (Potentially Thrashing): 1.685 ms Latency With Using 3 MB Persistent L2 Cache (Non-Thrashing): 1.674 ms $ ./l2-persistent 2 GPU: NVIDIA GeForce RTX 3090 L2 Cache Size: 6 MB Max Persistent L2 Cache Size: 4 MB Persistent Data Size: 2 MB Steaming Data Size: 1024 MB Latency Without Using Persistent L2 Cache: 2.158 ms Latency With Using 3 MB Persistent L2 Cache (Potentially Thrashing): 1.997 ms Latency With Using 3 MB Persistent L2 Cache (Non-Thrashing): 2.002 ms $ ./l2-persistent 3 GPU: NVIDIA GeForce RTX 3090 L2 Cache Size: 6 MB Max Persistent L2 Cache Size: 4 MB Persistent Data Size: 3 MB Steaming Data Size: 1024 MB Latency Without Using Persistent L2 Cache: 3.095 ms Latency With Using 3 MB Persistent L2 Cache (Potentially Thrashing): 2.510 ms Latency With Using 3 MB Persistent L2 Cache (Non-Thrashing): 2.533 ms $ ./l2-persistent 4 GPU: NVIDIA GeForce RTX 3090 L2 Cache Size: 6 MB Max Persistent L2 Cache Size: 4 MB Persistent Data Size: 4 MB Steaming Data Size: 1024 MB Latency Without Using Persistent L2 Cache: 3.906 ms Latency With Using 3 MB Persistent L2 Cache (Potentially Thrashing): 3.632 ms Latency With Using 3 MB Persistent L2 Cache (Non-Thrashing): 3.706 ms $ ./l2-persistent 5 GPU: NVIDIA GeForce RTX 3090 L2 Cache Size: 6 MB Max Persistent L2 Cache Size: 4 MB Persistent Data Size: 5 MB Steaming Data Size: 1024 MB Latency Without Using Persistent L2 Cache: 4.120 ms Latency With Using 3 MB Persistent L2 Cache (Potentially Thrashing): 4.554 ms Latency With Using 3 MB Persistent L2 Cache (Non-Thrashing): 3.920 ms $ ./l2-persistent 6 GPU: NVIDIA GeForce RTX 3090 L2 Cache Size: 6 MB Max Persistent L2 Cache Size: 4 MB Persistent Data Size: 6 MB Steaming Data Size: 1024 MB Latency Without Using Persistent L2 Cache: 4.194 ms Latency With Using 3 MB Persistent L2 Cache (Potentially Thrashing): 4.583 ms Latency With Using 3 MB Persistent L2 Cache (Non-Thrashing): 4.255 ms
We could see that even when the persistent data size is larger than the persistent L2 cache, the latency of using persistent L2 cache that has no-thrashing usually does not perform worse than the baseline.
FAQ
Persistent Cache VS Shared Memory?
The persistent cache is different from the shared memory. The persistent cache is visible to all the threads in the GPU, while the shared memory is only visible to the threads in the same block.
For small-sized frequently accessed data, we could also use the shared memory to accelerate the data access. However, the shared memory is limited to 48 to 96 KB per block of threads, depending on the GPU, while the persistent cache is limited to a few MB per GPU.