CUDA shared memory is an extremely powerful feature for CUDA kernel implementation and optimization. Because CUDA shared memory is located on chip, its memory bandwidth is much larger than the global memory which is located off chip. Therefore, CUDA kernel optimization by caching memory access on shared memory can improve the performance of some operations significantly, especially for those memory-bound operations.
However, CUDA shared memory has size limits for each thread block which is 48 KB by default. Sometimes, we would like to use a little bit more shared memory for our implementations. In this blog post, I would like to discuss how to allocate static shared memory, dynamic shared memory, and how to request more than 48 KB dynamic shared memory.
Stencil Kernel
We have implemented a stencil kernel for demonstrating the allocation of CUDA shared memory. Stencil is almost mathematically equivalent as a special case of convolution whose weights are exactly 1 with valid padding.
For example, given an 1D array of $\{1, 1, 1, 1, 1, 1, 1\}$ and a stencil kernel with a radius of $2$, we will have the output 1D array $\{1, 1, 5, 5, 5, 1, 1\}$.
The stencil operation will have many redundant memory reads from the input tensor and thus is a memory-bound operation. If the memory reads are not cached and the program reads from the global memory, the performance will be poor. Therefore, we will take advantage of shared memory which is on chip to cache the memory reads and improve the performance.
Static Shared Memory
In this implementation, we allocated static shared memory whose size must be known at compile time. The implementation also supports arbitrary “valid” array size, radius, and CUDA thread block size. Also notice that when we implement the kernel, we have to pay special attention to the scenario when the radius is larger than the CUDA thread block size and the “valid” array size is not divisible by the CUDA thread block size, as it is not easy to implement them correctly.
template <int BLOCK_SIZE = 1024, int RADIUS = 5> __global__ voidstencil_1d_kernel(intconst* d_in, int* d_out, int valid_array_size) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
// This has to be int because we will use negative indices. intconst gindex{static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x)}; intconst lindex{static_cast<int>(threadIdx.x) + RADIUS};
If we increase the radius from 1025 to some larger values such as 6000, we will get the following compilation error.
1 2
$ nvcc stencil_static_shared_memory.cu -o stencil_static_shared_memory ptxas error : Entry function'_Z17stencil_1d_kernelILi1024ELi6000EEvPKiPii' uses too much shared data (0xcb80 bytes, 0xc000 max)
This is because the user could only allocate the CUDA static shared memory up to 48 KB. In our use case, BLOCK_SIZE + 2 * RADIUS = $1024 + 2 \times 6000$ = $13024$ and the size of an int is $4$ bytes, therefore, the shared memory required is $17024 \times 4 / 1024$ = $50.875$ KB, which is larger than the maximum static shared memory we could have.
Dynamic Shared Memory
To use shared memory larger than 48 KB, we will have to use dynamic shared memory and it is architecture specific. Specifically, CUDA Runtime API cudaFuncSetAttribute has to be called in addition to specifying the dynamic shared memory size we want to request in the third argument in <<<...>>> for CUDA launch, and we should always check its return as it can fail during runtime on certain architectures.
The platform GPU is NVIDIA RTX 2080TI. According to the CUDA C Programming Guide, compute capability 7.x devices allow a single thread block to dynamically allocate shared memory up to 64 KB on Turing. So we could run the stencil program with a radius of 6000 on NVIDIA RTX 2080TI.
This implementation with dynamic shared memory is almost the same as the one with static shared memory.
template <int BLOCK_SIZE = 1024, int RADIUS = 5> __global__ voidstencil_1d_kernel(intconst* d_in, int* d_out, int valid_array_size) { extern __shared__ int temp[];
// This has to be int because we will use negative indices. intconst gindex{static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x)}; intconst lindex{static_cast<int>(threadIdx.x) + RADIUS};
The reason why large shared memory can only be allocated for dynamic shared memory is that not all the GPU architecture can support certain size of shared memory that is larger than 48 KB. If static shared memory larger than 48 KB is allowed, the CUDA program will compile but fail on some specific GPU architectures, which is not desired. Therefore, to use shared memory that is larger than 48 KB, it has to be requested via dynamic shared memory during runtime. If the GPU architecture does not support shared memory of certain size, a CUDA runtime error will be returned.