CUDA L2 Persistent Cache

09-12-202211-12-2023 blog 13 minutes read (About 1955 words) visits

Introduction

Starting with CUDA 11.0, devices of compute capability 8.0 and above have the capability to influence persistence of data in the L2 cache. Because L2 cache is on-chip, it potentially provides higher bandwidth and lower latency accesses to global memory.

In this blog post, I created a CUDA example to demonstrate the how to use the L2 persistent cache to accelerate the data traffic.

CUDA L2 Persistent Cache

In this example, I would have a small constant buffer of certain values that will be used for resetting a large streaming buffer. For example, if the constant buffer is of size 4 and has values of [5, 2, 1, 4] and the large streaming buffer to be reset is of size 100, after resetting the large streaming buffer will have values of [5, 2, 1, 4, 5, 2, 1, 4, ...], namely repeating the values of the constant buffer.

Because the streaming buffer is much larger than the constant buffer, each element from the constant buffer is accessed more often than the streaming buffer. Accessing buffer from global memory is very expensive. If we could cache the frequently accessed constant buffer in L2 cache, the access to the frequently accessed constant buffer could be accelerated.

CUDA Data Resetting

For the data resetting CUDA kernel, I created a baseline which launches the kernel without using persistent L2 cache, a variant which launches the kernel using 3 MB persistent L2 cache but has data thrashing when the constant buffer size exceeds 3 MB, and a optimized variant which launches the kernel using 3 MB persistent L2 cache but the data thrashing was eliminated.

l2-persistent.cu

#include <algorithm>
#include <cassert>
#include <cstdlib>
#include <functional>
#include <iomanip>
#include <iostream>
#include <vector>

#include <cuda_runtime.h>

#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
void check(cudaError_t err, char const* const func, char const* const file,
           int const line)
{
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
        std::exit(EXIT_FAILURE);
    }
}

#define CHECK_LAST_CUDA_ERROR() checkLast(__FILE__, __LINE__)
void checkLast(char const* const file, int const line)
{
    cudaError_t const err{cudaGetLastError()};
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << std::endl;
        std::exit(EXIT_FAILURE);
    }
}

template <class T>
float measure_performance(std::function<T(cudaStream_t)> bound_function,
                          cudaStream_t stream, int num_repeats = 100,
                          int num_warmups = 100)
{
    cudaEvent_t start, stop;
    float time;

    CHECK_CUDA_ERROR(cudaEventCreate(&start));
    CHECK_CUDA_ERROR(cudaEventCreate(&stop));

    for (int i{0}; i < num_warmups; ++i)
    {
        bound_function(stream);
    }

    CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));

    CHECK_CUDA_ERROR(cudaEventRecord(start, stream));
    for (int i{0}; i < num_repeats; ++i)
    {
        bound_function(stream);
    }
    CHECK_CUDA_ERROR(cudaEventRecord(stop, stream));
    CHECK_CUDA_ERROR(cudaEventSynchronize(stop));
    CHECK_LAST_CUDA_ERROR();
    CHECK_CUDA_ERROR(cudaEventElapsedTime(&time, start, stop));
    CHECK_CUDA_ERROR(cudaEventDestroy(start));
    CHECK_CUDA_ERROR(cudaEventDestroy(stop));

    float const latency{time / num_repeats};

    return latency;
}

__global__ void reset_data(int* data_streaming, int const* lut_persistent,
                           size_t data_streaming_size,
                           size_t lut_persistent_size)
{
    size_t const idx{blockDim.x * blockIdx.x + threadIdx.x};
    size_t const stride{blockDim.x * gridDim.x};
    for (size_t i{idx}; i < data_streaming_size; i += stride)
    {
        data_streaming[i] = lut_persistent[i % lut_persistent_size];
    }
}

/**
 * @brief Reset the data_streaming using lut_persistent so that the
 * data_streaming is lut_persistent repeatedly.
 *
 * @param data_streaming The data for reseting.
 * @param lut_persistent The values for resetting data_streaming.
 * @param data_streaming_size The size for data_streaming.
 * @param lut_persistent_size The size for lut_persistent.
 * @param stream The CUDA stream.
 */
void launch_reset_data(int* data_streaming, int const* lut_persistent,
                       size_t data_streaming_size, size_t lut_persistent_size,
                       cudaStream_t stream)
{
    dim3 const threads_per_block{1024};
    dim3 const blocks_per_grid{32};
    reset_data<<<blocks_per_grid, threads_per_block, 0, stream>>>(
        data_streaming, lut_persistent, data_streaming_size,
        lut_persistent_size);
    CHECK_LAST_CUDA_ERROR();
}

bool verify_data(int* data, int n, size_t size)
{
    for (size_t i{0}; i < size; ++i)
    {
        if (data[i] != i % n)
        {
            return false;
        }
    }
    return true;
}

int main(int argc, char* argv[])
{
    size_t num_megabytes_persistent_data{3};
    if (argc == 2)
    {
        num_megabytes_persistent_data = std::atoi(argv[1]);
    }

    constexpr int const num_repeats{100};
    constexpr int const num_warmups{10};

    cudaDeviceProp device_prop{};
    int current_device{0};
    CHECK_CUDA_ERROR(cudaGetDevice(&current_device));
    CHECK_CUDA_ERROR(cudaGetDeviceProperties(&device_prop, current_device));
    std::cout << "GPU: " << device_prop.name << std::endl;
    std::cout << "L2 Cache Size: " << device_prop.l2CacheSize / 1024 / 1024
              << " MB" << std::endl;
    std::cout << "Max Persistent L2 Cache Size: "
              << device_prop.persistingL2CacheMaxSize / 1024 / 1024 << " MB"
              << std::endl;

    size_t const num_megabytes_streaming_data{1024};
    if (num_megabytes_persistent_data > num_megabytes_streaming_data)
    {
        std::runtime_error(
            "Try setting persistent data size smaller than 1024 MB.");
    }
    size_t const size_persistent(num_megabytes_persistent_data * 1024 * 1024 /
                                 sizeof(int));
    size_t const size_streaming(num_megabytes_streaming_data * 1024 * 1024 /
                                sizeof(int));
    std::cout << "Persistent Data Size: " << num_megabytes_persistent_data
              << " MB" << std::endl;
    std::cout << "Steaming Data Size: " << num_megabytes_streaming_data << " MB"
              << std::endl;
    cudaStream_t stream;

    std::vector<int> lut_persistent_vec(size_persistent, 0);
    for (size_t i{0}; i < lut_persistent_vec.size(); ++i)
    {
        lut_persistent_vec[i] = i;
    }
    std::vector<int> data_streaming_vec(size_streaming, 0);

    int* d_lut_persistent;
    int* d_data_streaming;
    int* h_lut_persistent = lut_persistent_vec.data();
    int* h_data_streaming = data_streaming_vec.data();

    CHECK_CUDA_ERROR(
        cudaMalloc(&d_lut_persistent, size_persistent * sizeof(int)));
    CHECK_CUDA_ERROR(
        cudaMalloc(&d_data_streaming, size_streaming * sizeof(int)));
    CHECK_CUDA_ERROR(cudaStreamCreate(&stream));
    CHECK_CUDA_ERROR(cudaMemcpy(d_lut_persistent, h_lut_persistent,
                                size_persistent * sizeof(int),
                                cudaMemcpyHostToDevice));

    launch_reset_data(d_data_streaming, d_lut_persistent, size_streaming,
                      size_persistent, stream);
    CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));
    CHECK_CUDA_ERROR(cudaMemcpy(h_data_streaming, d_data_streaming,
                                size_streaming * sizeof(int),
                                cudaMemcpyDeviceToHost));
    assert(verify_data(h_data_streaming, size_persistent, size_streaming));

    std::function<void(cudaStream_t)> const function{
        std::bind(launch_reset_data, d_data_streaming, d_lut_persistent,
                  size_streaming, size_persistent, std::placeholders::_1)};
    float const latency{
        measure_performance(function, stream, num_repeats, num_warmups)};
    std::cout << std::fixed << std::setprecision(3)
              << "Latency Without Using Persistent L2 Cache: " << latency
              << " ms" << std::endl;

    // Start to use persistent cache.
    cudaStream_t stream_persistent_cache;
    size_t const num_megabytes_persistent_cache{3};
    CHECK_CUDA_ERROR(cudaStreamCreate(&stream_persistent_cache));

    CHECK_CUDA_ERROR(
        cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize,
                           num_megabytes_persistent_cache * 1024 * 1024));

    cudaStreamAttrValue stream_attribute_thrashing;
    stream_attribute_thrashing.accessPolicyWindow.base_ptr =
        reinterpret_cast<void*>(d_lut_persistent);
    stream_attribute_thrashing.accessPolicyWindow.num_bytes =
        num_megabytes_persistent_data * 1024 * 1024;
    stream_attribute_thrashing.accessPolicyWindow.hitRatio = 1.0;
    stream_attribute_thrashing.accessPolicyWindow.hitProp =
        cudaAccessPropertyPersisting;
    stream_attribute_thrashing.accessPolicyWindow.missProp =
        cudaAccessPropertyStreaming;

    CHECK_CUDA_ERROR(cudaStreamSetAttribute(
        stream_persistent_cache, cudaStreamAttributeAccessPolicyWindow,
        &stream_attribute_thrashing));

    float const latency_persistent_cache_thrashing{measure_performance(
        function, stream_persistent_cache, num_repeats, num_warmups)};
    std::cout << std::fixed << std::setprecision(3) << "Latency With Using "
              << num_megabytes_persistent_cache
              << " MB Persistent L2 Cache (Potentially Thrashing): "
              << latency_persistent_cache_thrashing << " ms" << std::endl;

    cudaStreamAttrValue stream_attribute_non_thrashing{
        stream_attribute_thrashing};
    stream_attribute_non_thrashing.accessPolicyWindow.hitRatio =
        std::min(static_cast<double>(num_megabytes_persistent_cache) /
                     num_megabytes_persistent_data,
                 1.0);
    CHECK_CUDA_ERROR(cudaStreamSetAttribute(
        stream_persistent_cache, cudaStreamAttributeAccessPolicyWindow,
        &stream_attribute_non_thrashing));

    float const latency_persistent_cache_non_thrashing{measure_performance(
        function, stream_persistent_cache, num_repeats, num_warmups)};
    std::cout << std::fixed << std::setprecision(3) << "Latency With Using "
              << num_megabytes_persistent_cache
              << " MB Persistent L2 Cache (Non-Thrashing): "
              << latency_persistent_cache_non_thrashing << " ms" << std::endl;

    CHECK_CUDA_ERROR(cudaFree(d_lut_persistent));
    CHECK_CUDA_ERROR(cudaFree(d_data_streaming));
    CHECK_CUDA_ERROR(cudaStreamDestroy(stream));
    CHECK_CUDA_ERROR(cudaStreamDestroy(stream_persistent_cache));
}

To avoid data thrashing, the product of accessPolicyWindow.hitRatio and accessPolicyWindow.num_bytes should be less than or equal to the cudaLimitPersistingL2CacheSize. The accessPolicyWindow.hitRatio parameter can be used to specify the fraction of accesses that receive the accessPolicyWindow.hitProp property, which is usually cudaAccessPropertyPersisting. The accessPolicyWindow.num_bytes parameter can be used to specify the number of bytes that the access policy window covers, which is usually the size of the persistent data.

In practice, we could set the accessPolicyWindow.hitRatio to be the ratio of the persistent L2 cache size to the persistent data size. For example, if the the persistent L2 cache size is 3 MB and the persistent data size is 4 MB, we could set the accessPolicyWindow.hitRatio to be 3 / 4 = 0.75.

Run CUDA Data Resetting

We could build and run the example on an NVIDIA Ampere GPU. In my case, I used an NVIDIA RTX 3090 GPU.

$ nvcc l2-persistent.cu -o l2-persistent -std=c++14 --gpu-architecture=compute_80
$ ./l2-persistent
GPU: NVIDIA GeForce RTX 3090
L2 Cache Size: 6 MB
Max Persistent L2 Cache Size: 4 MB
Persistent Data Size: 3 MB
Steaming Data Size: 1024 MB
Latency Without Using Persistent L2 Cache: 3.071 ms
Latency With Using 3 MB Persistent L2 Cache (Potentially Thrashing): 2.436 ms
Latency With Using 3 MB Persistent L2 Cache (Non-Thrashing): 2.443 ms

We could see that when the persistent data size is 3 MB and the persistent L2 cache is 3 MB, the performance of the application is improved by roughly 20%.

Benchmarking

We could also run some mini benchmarking by varying the persistent data size.

$ ./l2-persistent 1
GPU: NVIDIA GeForce RTX 3090
L2 Cache Size: 6 MB
Max Persistent L2 Cache Size: 4 MB
Persistent Data Size: 1 MB
Steaming Data Size: 1024 MB
Latency Without Using Persistent L2 Cache: 1.754 ms
Latency With Using 3 MB Persistent L2 Cache (Potentially Thrashing): 1.685 ms
Latency With Using 3 MB Persistent L2 Cache (Non-Thrashing): 1.674 ms
$ ./l2-persistent 2
GPU: NVIDIA GeForce RTX 3090
L2 Cache Size: 6 MB
Max Persistent L2 Cache Size: 4 MB
Persistent Data Size: 2 MB
Steaming Data Size: 1024 MB
Latency Without Using Persistent L2 Cache: 2.158 ms
Latency With Using 3 MB Persistent L2 Cache (Potentially Thrashing): 1.997 ms
Latency With Using 3 MB Persistent L2 Cache (Non-Thrashing): 2.002 ms
$ ./l2-persistent 3
GPU: NVIDIA GeForce RTX 3090
L2 Cache Size: 6 MB
Max Persistent L2 Cache Size: 4 MB
Persistent Data Size: 3 MB
Steaming Data Size: 1024 MB
Latency Without Using Persistent L2 Cache: 3.095 ms
Latency With Using 3 MB Persistent L2 Cache (Potentially Thrashing): 2.510 ms
Latency With Using 3 MB Persistent L2 Cache (Non-Thrashing): 2.533 ms
$ ./l2-persistent 4
GPU: NVIDIA GeForce RTX 3090
L2 Cache Size: 6 MB
Max Persistent L2 Cache Size: 4 MB
Persistent Data Size: 4 MB
Steaming Data Size: 1024 MB
Latency Without Using Persistent L2 Cache: 3.906 ms
Latency With Using 3 MB Persistent L2 Cache (Potentially Thrashing): 3.632 ms
Latency With Using 3 MB Persistent L2 Cache (Non-Thrashing): 3.706 ms
$ ./l2-persistent 5
GPU: NVIDIA GeForce RTX 3090
L2 Cache Size: 6 MB
Max Persistent L2 Cache Size: 4 MB
Persistent Data Size: 5 MB
Steaming Data Size: 1024 MB
Latency Without Using Persistent L2 Cache: 4.120 ms
Latency With Using 3 MB Persistent L2 Cache (Potentially Thrashing): 4.554 ms
Latency With Using 3 MB Persistent L2 Cache (Non-Thrashing): 3.920 ms
$ ./l2-persistent 6
GPU: NVIDIA GeForce RTX 3090
L2 Cache Size: 6 MB
Max Persistent L2 Cache Size: 4 MB
Persistent Data Size: 6 MB
Steaming Data Size: 1024 MB
Latency Without Using Persistent L2 Cache: 4.194 ms
Latency With Using 3 MB Persistent L2 Cache (Potentially Thrashing): 4.583 ms
Latency With Using 3 MB Persistent L2 Cache (Non-Thrashing): 4.255 ms

We could see that even when the persistent data size is larger than the persistent L2 cache, the latency of using persistent L2 cache that has no-thrashing usually does not perform worse than the baseline.

FAQ

Persistent Cache VS Shared Memory?

The persistent cache is different from the shared memory. The persistent cache is visible to all the threads in the GPU, while the shared memory is only visible to the threads in the same block.

For small-sized frequently accessed data, we could also use the shared memory to accelerate the data access. However, the shared memory is limited to 48 to 96 KB per block of threads, depending on the GPU, while the persistent cache is limited to a few MB per GPU.

References

CUDA L2 Persistent Cache

https://leimao.github.io/blog/CUDA-L2-Persistent-Cache/

Author

Lei Mao

Posted on

09-12-2022

Updated on

11-12-2023

Licensed under

CUDA

CUDA L2 Persistent Cache

Introduction

CUDA L2 Persistent Cache

CUDA Data Resetting

Run CUDA Data Resetting

Benchmarking

FAQ

Persistent Cache VS Shared Memory?

References

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Advertisement

Catalogue