CUDA Shared Memory Capacity

07-04-202206-12-2025 blog 13 minutes read (About 1982 words) visits

Introduction

CUDA shared memory is an extremely powerful feature for CUDA kernel implementation and optimization. Because CUDA shared memory is located on chip, its memory bandwidth is much larger than the global memory which is located off chip. Therefore, CUDA kernel optimization by caching memory access on shared memory can improve the performance of some operations significantly, especially for those memory-bound operations.

However, CUDA shared memory has size limits for each thread block which is 48 KB by default. Sometimes, we would like to use a little bit more shared memory for our implementations. In this blog post, I would like to discuss how to allocate static shared memory, dynamic shared memory, and how to request more than 48 KB dynamic shared memory.

Stencil Kernel

We have implemented a stencil kernel for demonstrating the allocation of CUDA shared memory. Stencil is almost mathematically equivalent as a special case of convolution whose weights are exactly 1 with valid padding.

For example, given an 1D array of $\{1, 1, 1, 1, 1, 1, 1\}$ and a stencil kernel with a radius of $2$, we will have the output 1D array $\{1, 1, 5, 5, 5, 1, 1\}$.

The stencil operation will have many redundant memory reads from the input tensor and thus is a memory-bound operation. If the memory reads are not cached and the program reads from the global memory, the performance will be poor. Therefore, we will take advantage of shared memory which is on chip to cache the memory reads and improve the performance.

Static Shared Memory

In this implementation, we allocated static shared memory whose size must be known at compile time. The implementation also supports arbitrary “valid” array size, radius, and CUDA thread block size. Also notice that when we implement the kernel, we have to pay special attention to the scenario when the radius is larger than the CUDA thread block size and the “valid” array size is not divisible by the CUDA thread block size, as it is not easy to implement them correctly.

stencil_static_shared_memory.cu

#include <cassert>
#include <iostream>
#include <vector>

#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
void check(cudaError_t err, char const* const func, char const* const file,
           int const line)
{
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
        std::exit(EXIT_FAILURE);
    }
}

#define CHECK_LAST_CUDA_ERROR() checkLast(__FILE__, __LINE__)
void checkLast(char const* const file, int const line)
{
    cudaError_t const err{cudaGetLastError()};
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << std::endl;
        std::exit(EXIT_FAILURE);
    }
}

template <int BLOCK_SIZE = 1024, int RADIUS = 5>
__global__ void stencil_1d_kernel(int const* d_in, int* d_out,
                                  int valid_array_size)
{
    __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];

    // This has to be int because we will use negative indices.
    int const gindex{static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x)};
    int const lindex{static_cast<int>(threadIdx.x) + RADIUS};

    int const valid_block_size{
        min(BLOCK_SIZE,
            valid_array_size - static_cast<int>(blockIdx.x * blockDim.x))};

    // Read input elements into shared memory
    if (gindex < valid_array_size)
    {
        temp[lindex] = d_in[gindex];
        if (RADIUS <= valid_block_size)
        {
            if (threadIdx.x < RADIUS)
            {
                temp[lindex - RADIUS] = d_in[gindex - RADIUS];
                temp[lindex + valid_block_size] =
                    d_in[gindex + valid_block_size];
            }
        }
        else
        {
            for (int i{0}; i < RADIUS; i += valid_block_size)
            {
                // Some threads might have to do one more job than other
                // threads.
                if (lindex - RADIUS + i < RADIUS)
                {
                    temp[lindex - RADIUS + i] = d_in[gindex - RADIUS + i];
                    temp[lindex + valid_block_size + i] =
                        d_in[gindex + valid_block_size + i];
                }
            }
        }
    }
    // Synchronize (ensure all the data is available)
    __syncthreads();

    if (gindex >= valid_array_size)
    {
        return;
    }

    // Apply the stencil
    int result{0};
    for (int offset{-RADIUS}; offset <= RADIUS; offset++)
    {
        result += temp[lindex + offset];
    }

    // Store the result
    d_out[gindex] = result;
}

void stencil_1d_cpu(int const* h_in, int* h_out, int radius,
                    int valid_array_size)
{
    for (int i{0}; i < valid_array_size; ++i)
    {
        int result{0};
        for (int offset{-radius}; offset <= radius; offset++)
        {
            result += h_in[i + offset];
        }
        h_out[i] = result;
    }
}

int main(int argc, char** argv)
{
    constexpr int const valid_array_size{1024 * 100 + 1};
    constexpr int const block_size{1024};
    constexpr int const grid_size{(valid_array_size + block_size - 1) /
                                  block_size};
    constexpr int const radius{1025};

    int const array_size{valid_array_size + 2 * radius};
    std::vector<int> const h_in(array_size, 1);
    std::vector<int> h_out{h_in};
    std::vector<int> h_out_reference{h_in};

    stencil_1d_cpu(h_in.data() + radius, h_out_reference.data() + radius,
                   radius, valid_array_size);

    int* d_in;
    int* d_out;

    CHECK_CUDA_ERROR(cudaMalloc(&d_in, array_size * sizeof(int)));
    CHECK_CUDA_ERROR(cudaMalloc(&d_out, array_size * sizeof(int)));

    CHECK_CUDA_ERROR(cudaMemcpy(d_in, h_in.data(), array_size * sizeof(int),
                                cudaMemcpyHostToDevice));
    CHECK_CUDA_ERROR(cudaMemcpy(d_out, h_out.data(), array_size * sizeof(int),
                                cudaMemcpyHostToDevice));

    stencil_1d_kernel<block_size, radius><<<grid_size, block_size>>>(
        d_in + radius, d_out + radius, valid_array_size);
    CHECK_LAST_CUDA_ERROR();

    CHECK_CUDA_ERROR(cudaDeviceSynchronize());

    CHECK_CUDA_ERROR(cudaMemcpy(h_out.data(), d_out, array_size * sizeof(int),
                                cudaMemcpyDeviceToHost));

    for (int i{0}; i < h_out_reference.size(); ++i)
    {
        assert(h_out[i] == h_out_reference[i]);
    }

    CHECK_CUDA_ERROR(cudaFree(d_in));
    CHECK_CUDA_ERROR(cudaFree(d_out));
}

1 2	$ nvcc stencil_static_shared_memory.cu -o stencil_static_shared_memory $ ./stencil_static_shared_memory

If we increase the radius from 1025 to some larger values such as 6000, we will get the following compilation error.

1
2

$ nvcc stencil_static_shared_memory.cu -o stencil_static_shared_memory
ptxas error   : Entry function '_Z17stencil_1d_kernelILi1024ELi6000EEvPKiPii' uses too much shared data (0xcb80 bytes, 0xc000 max)

This is because the user could only allocate the CUDA static shared memory up to 48 KB. In our use case, BLOCK_SIZE + 2 * RADIUS = $1024 + 2 \times 6000$ = $13024$ and the size of an int is $4$ bytes, therefore, the shared memory required is $17024 \times 4 / 1024$ = $50.875$ KB, which is larger than the maximum static shared memory we could have.

Dynamic Shared Memory

To use shared memory larger than 48 KB, we will have to use dynamic shared memory and it is architecture specific. Specifically, CUDA Runtime API cudaFuncSetAttribute has to be called in addition to specifying the maximum dynamic shared memory size we want to request in the third argument in <<<...>>> for CUDA launch, and we should always check its return as it can fail during runtime on certain architectures.

The platform GPU used in the following example is NVIDIA RTX 2080TI. According to the CUDA C Programming Guide, compute capability 7.x devices allow a single thread block to dynamically allocate shared memory up to 64 KB on Turing. So we could run the stencil program with a radius of 6000 on NVIDIA RTX 2080TI.

This implementation with dynamic shared memory is almost the same as the one with static shared memory.

stencil_dynamic_shared_memory.cu

#include <cassert>
#include <iostream>
#include <vector>

#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
void check(cudaError_t err, char const* const func, char const* const file,
           int const line)
{
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
        std::exit(EXIT_FAILURE);
    }
}

#define CHECK_LAST_CUDA_ERROR() checkLast(__FILE__, __LINE__)
void checkLast(char const* const file, int const line)
{
    cudaError_t const err{cudaGetLastError()};
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << std::endl;
        std::exit(EXIT_FAILURE);
    }
}

template <int BLOCK_SIZE = 1024, int RADIUS = 5>
__global__ void stencil_1d_kernel(int const* d_in, int* d_out,
                                  int valid_array_size)
{
    extern __shared__ int temp[];

    // This has to be int because we will use negative indices.
    int const gindex{static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x)};
    int const lindex{static_cast<int>(threadIdx.x) + RADIUS};

    int const valid_block_size{
        min(BLOCK_SIZE,
            valid_array_size - static_cast<int>(blockIdx.x * blockDim.x))};

    // Read input elements into shared memory
    if (gindex < valid_array_size)
    {
        temp[lindex] = d_in[gindex];
        if (RADIUS <= valid_block_size)
        {
            if (threadIdx.x < RADIUS)
            {
                temp[lindex - RADIUS] = d_in[gindex - RADIUS];
                temp[lindex + valid_block_size] =
                    d_in[gindex + valid_block_size];
            }
        }
        else
        {
            for (int i{0}; i < RADIUS; i += valid_block_size)
            {
                // Some threads might have to do one more job than other
                // threads.
                if (lindex - RADIUS + i < RADIUS)
                {
                    temp[lindex - RADIUS + i] = d_in[gindex - RADIUS + i];
                    temp[lindex + valid_block_size + i] =
                        d_in[gindex + valid_block_size + i];
                }
            }
        }
    }
    // Synchronize (ensure all the data is available)
    __syncthreads();

    if (gindex >= valid_array_size)
    {
        return;
    }

    // Apply the stencil
    int result{0};
    for (int offset{-RADIUS}; offset <= RADIUS; offset++)
    {
        result += temp[lindex + offset];
    }

    // Store the result
    d_out[gindex] = result;
}

void stencil_1d_cpu(int const* h_in, int* h_out, int radius,
                    int valid_array_size)
{
    for (int i{0}; i < valid_array_size; ++i)
    {
        int result{0};
        for (int offset{-radius}; offset <= radius; offset++)
        {
            result += h_in[i + offset];
        }
        h_out[i] = result;
    }
}

int main(int argc, char** argv)
{
    constexpr int const valid_array_size{1024 * 100 + 1};
    constexpr int const block_size{1024};
    constexpr int const grid_size{(valid_array_size + block_size - 1) /
                                  block_size};
    constexpr int const radius{6000};

    int const array_size{valid_array_size + 2 * radius};
    std::vector<int> const h_in(array_size, 1);
    std::vector<int> h_out{h_in};
    std::vector<int> h_out_reference{h_in};

    stencil_1d_cpu(h_in.data() + radius, h_out_reference.data() + radius,
                   radius, valid_array_size);

    int* d_in;
    int* d_out;

    CHECK_CUDA_ERROR(cudaMalloc(&d_in, array_size * sizeof(int)));
    CHECK_CUDA_ERROR(cudaMalloc(&d_out, array_size * sizeof(int)));

    CHECK_CUDA_ERROR(cudaMemcpy(d_in, h_in.data(), array_size * sizeof(int),
                                cudaMemcpyHostToDevice));
    CHECK_CUDA_ERROR(cudaMemcpy(d_out, h_out.data(), array_size * sizeof(int),
                                cudaMemcpyHostToDevice));

    int const shared_memory_bytes{(block_size + radius * 2) * sizeof(int)};
    CHECK_CUDA_ERROR(cudaFuncSetAttribute(
        stencil_1d_kernel<block_size, radius>,
        cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_bytes));
    stencil_1d_kernel<block_size, radius>
        <<<grid_size, block_size, shared_memory_bytes>>>(
            d_in + radius, d_out + radius, valid_array_size);
    CHECK_LAST_CUDA_ERROR();

    CHECK_CUDA_ERROR(cudaDeviceSynchronize());

    CHECK_CUDA_ERROR(cudaMemcpy(h_out.data(), d_out, array_size * sizeof(int),
                                cudaMemcpyDeviceToHost));

    for (int i{0}; i < h_out_reference.size(); ++i)
    {
        assert(h_out[i] == h_out_reference[i]);
    }

    CHECK_CUDA_ERROR(cudaFree(d_in));
    CHECK_CUDA_ERROR(cudaFree(d_out));
}

1 2	$ nvcc stencil_dynamic_shared_memory.cu -o stencil_dynamic_shared_memory --gpu-architecture=compute_75 --gpu-code=sm_75 $ ./stencil_dynamic_shared_memory

There is one caveat about cudaFuncSetAttribute in the context of multiple threads. If cudaFuncSetAttribute is called for the same kernel from multiple threads, there is a racing condition of setting the attributes, including the maximum dynamic shared memory size allowed, for the kernel. Because of this, some kernel launch might fail because the maximum dynamic shared memory size allowed is smaller than the requested size. Therefore, it is recommended to call cudaFuncSetAttribute only once before kernels are launched, and the maximum dynamic shared memory size allowed should be set to the largest size that is required by all the configurations that can be used to launch the kernel.

Conclusion

The reason why large shared memory can only be allocated for dynamic shared memory is that not all the GPU architecture can support certain size of shared memory that is larger than 48 KB. If static shared memory larger than 48 KB is allowed, the CUDA program will compile but fail on some specific GPU architectures, which is not desired. Therefore, to use shared memory that is larger than 48 KB, it has to be requested via dynamic shared memory during runtime. If the GPU architecture does not support shared memory of certain size, a CUDA runtime error will be returned.

References

CUDA Shared Memory Capacity

https://leimao.github.io/blog/CUDA-Shared-Memory-Capacity/

Author

Lei Mao

Posted on

07-04-2022

Updated on

06-12-2025

Licensed under

CUDA

CUDA Shared Memory Capacity

Introduction

Stencil Kernel

Static Shared Memory

Dynamic Shared Memory

Conclusion

References

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Advertisement

Catalogue