CUDA Local Memory

03-19-202503-19-2025 blog 12 minutes read (About 1835 words) visits

Introduction

In CUDA programming, local memory is private storage for an executing thread, and is not visible outside of that thread. The local memory space resides in device memory, so local memory accesses have the same high latency and low bandwidth as global memory accesses and are subject to the same requirements for memory coalescing.

An automatic variable declared without the __device__, __shared__ and __constant__ memory space specifiers can either be placed in registers or in local memory by the compiler. It will be likely placed in local memory if it is one of the following:

Arrays for which it cannot determine that they are indexed with constant quantities,
Large structures or arrays that would consume too much register space,
Any variable if the kernel uses more registers than available (this is also known as register spilling).

It is very straightforward to understand the second and the third points. However, the first point is being a little bit tricky since it implies that even for very small arrays it can be placed in local memory rather than in registers and most of the time we would like those small arrays to be placed in registers for better performance.

In this blog post, I would like to show an example of how the compiler decides to place an array in local memory rather than in registers and discuss the general rules that a user can follow to avoid small arrays being placed in local memory.

CUDA Local Memory

In the following example, I created two CUDA kernels that compute the running mean of an input array given a fixed window size. Both of the kernels declared a local array window whose size is known at the compile time. The implementations of the two kernels are almost exactly the same except the first kernel uses a straightforward indexing to access the window array, while the second kernel uses an index that seems to be less trivial.

#include <cuda_runtime.h>
#include <iostream>
#include <vector>

#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
void check(cudaError_t err, char const* func, char const* file, int line)
{
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
        std::exit(EXIT_FAILURE);
    }
}

template <int WindowSize>
__global__ void running_mean_register_array(float const* input, float* output,
                                            int n)
{
    float window[WindowSize];
    int const thread_idx{
        static_cast<int>(blockIdx.x * blockDim.x + threadIdx.x)};
    int const stride{static_cast<int>(blockDim.x * gridDim.x)};
    for (int i{thread_idx}; i < n; i += stride)
    {
        // Read data into the window.
        for (int j{0}; j < WindowSize; ++j)
        {
            int const idx{i - WindowSize / 2 + j};
            window[j] = (idx < 0 || idx >= n) ? 0 : input[idx];
        }
        // Compute the mean from the window.
        float sum{0};
        for (int j{0}; j < WindowSize; ++j)
        {
            sum += window[j];
        }
        float const mean{sum / WindowSize};
        // Write the mean to the output.
        output[i] = mean;
    }
}

template <int WindowSize>
__global__ void running_mean_local_memory_array(float const* input,
                                                float* output, int n)
{
    float window[WindowSize];
    int const thread_idx{
        static_cast<int>(blockIdx.x * blockDim.x + threadIdx.x)};
    int const stride{static_cast<int>(blockDim.x * gridDim.x)};
    for (int i{thread_idx}; i < n; i += stride)
    {
        // Read data into the window.
        for (int j{0}; j < WindowSize; ++j)
        {
            int const idx{i - WindowSize / 2 + j};
            window[j] = (idx < 0 || idx >= n) ? 0 : input[idx];
        }
        // Compute the mean from the window.
        float sum{0};
        for (int j{0}; j < WindowSize; ++j)
        {
            // This index accessing the window array cannot be resolved at the
            // compile time by the compiler, even if such indexing would not
            // affect the correctness of the kernel. The consequence is the
            // compiler will place the window array in the local memory rather
            // than in the register file.
            int const idx{(j + n) % WindowSize};
            sum += window[idx];
        }
        float const mean{sum / WindowSize};
        // Write the mean to the output.
        output[i] = mean;
    }
}

template <int WindowSize>
cudaError_t launch_running_mean_register_array(float const* input,
                                               float* output, int n,
                                               cudaStream_t stream)
{
    dim3 const block_size{256, 1, 1};
    dim3 const grid_size{(n + block_size.x - 1) / block_size.x, 1, 1};
    running_mean_register_array<WindowSize>
        <<<grid_size, block_size, 0, stream>>>(input, output, n);
    return cudaGetLastError();
}

template <int WindowSize>
cudaError_t launch_running_mean_local_memory_array(float const* input,
                                                   float* output, int n,
                                                   cudaStream_t stream)
{
    dim3 const block_size{256, 1, 1};
    dim3 const grid_size{(n + block_size.x - 1) / block_size.x, 1, 1};
    running_mean_local_memory_array<WindowSize>
        <<<grid_size, block_size, 0, stream>>>(input, output, n);
    return cudaGetLastError();
}

// Verify the correctness of the kernel given a window size and a launch
// function.
template <int WindowSize>
void verify_running_mean(int n, cudaError_t (*launch_func)(float const*, float*,
                                                           int, cudaStream_t))
{
    std::vector<float> h_input_vec(n, 0.f);
    std::vector<float> h_output_vec(n, 1.f);
    std::vector<float> h_output_vec_ref(n, 2.f);
    // Fill the input vector with values.
    for (int i{0}; i < n; ++i)
    {
        h_input_vec[i] = static_cast<float>(i);
    }
    // Compute the reference output vector.
    for (int i{0}; i < n; ++i)
    {
        float sum{0};
        for (int j{0}; j < WindowSize; ++j)
        {
            int const idx{i - WindowSize / 2 + j};
            float const val{(idx < 0 || idx >= n) ? 0 : h_input_vec[idx]};
            sum += val;
        }
        h_output_vec_ref[i] = sum / WindowSize;
    }
    // Allocate device memory.
    float* d_input;
    float* d_output;
    CHECK_CUDA_ERROR(cudaMalloc(&d_input, n * sizeof(float)));
    CHECK_CUDA_ERROR(cudaMalloc(&d_output, n * sizeof(float)));
    // Copy data to the device.
    CHECK_CUDA_ERROR(cudaMemcpy(d_input, h_input_vec.data(), n * sizeof(float),
                                cudaMemcpyHostToDevice));
    CHECK_CUDA_ERROR(cudaMemcpy(d_output, h_output_vec.data(),
                                n * sizeof(float), cudaMemcpyHostToDevice));
    // Launch the kernel.
    cudaStream_t stream;
    CHECK_CUDA_ERROR(cudaStreamCreate(&stream));
    CHECK_CUDA_ERROR(launch_func(d_input, d_output, n, stream));
    CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));
    // Copy the result back to the host.
    CHECK_CUDA_ERROR(cudaMemcpy(h_output_vec.data(), d_output,
                                n * sizeof(float), cudaMemcpyDeviceToHost));
    // Check the result.
    for (int i{0}; i < n; ++i)
    {
        if (h_output_vec.at(i) != h_output_vec_ref.at(i))
        {
            std::cerr << "Mismatch at index " << i << ": " << h_output_vec.at(i)
                      << " != " << h_output_vec_ref.at(i) << std::endl;
            std::exit(EXIT_FAILURE);
        }
    }
    // Free device memory.
    CHECK_CUDA_ERROR(cudaFree(d_input));
    CHECK_CUDA_ERROR(cudaFree(d_output));
    CHECK_CUDA_ERROR(cudaStreamDestroy(stream));
}

int main()
{
    // Try different window sizes from small to large.
    constexpr int WindowSize{32};
    int const n{8192};
    verify_running_mean<WindowSize>(
        n, launch_running_mean_register_array<WindowSize>);
    verify_running_mean<WindowSize>(
        n, launch_running_mean_local_memory_array<WindowSize>);
    return 0;
}

To build and run the example, please run the following commands. There should be no error message encountered when running the example.

1 2	$ nvcc cuda_local_memory.cu -o cuda_local_memory $ ./cuda_local_memory

To examine if the local array window is placed in registers or in local memory, we can compile the code to PTX and inspect the PTX code.

To compile the code to PTX, please run the following command.

1	$ nvcc --ptx cuda_local_memory.cu -o cuda_local_memory.ptx

In the PTX of the two kernels, we could find that the first kernel has nothing declared with .local directive, while the second kernel has a local array __local_depot1 declared with .local directive. This confirms that the first kernel has the array window placed in registers, while the second kernel has the array window placed in local memory. Even if the local array declared in both kernels are of the same size, because the compiler cannot determine the array used in the second kernel is indexed with constant quantities, it is placed in local memory.

...

	// .globl	_Z27running_mean_register_arrayILi32EEvPKfPfi

.visible .entry _Z27running_mean_register_arrayILi32EEvPKfPfi(
	.param .u64 _Z27running_mean_register_arrayILi32EEvPKfPfi_param_0,
	.param .u64 _Z27running_mean_register_arrayILi32EEvPKfPfi_param_1,
	.param .u32 _Z27running_mean_register_arrayILi32EEvPKfPfi_param_2
)
{
	.reg .pred 	%p<99>;
	.reg .f32 	%f<162>;
	.reg .b32 	%r<41>;
	.reg .b64 	%rd<15>;
...
}
	// .globl	_Z31running_mean_local_memory_arrayILi32EEvPKfPfi
.visible .entry _Z31running_mean_local_memory_arrayILi32EEvPKfPfi(
	.param .u64 _Z31running_mean_local_memory_arrayILi32EEvPKfPfi_param_0,
	.param .u64 _Z31running_mean_local_memory_arrayILi32EEvPKfPfi_param_1,
	.param .u32 _Z31running_mean_local_memory_arrayILi32EEvPKfPfi_param_2
)
{
	.local .align 16 .b8 	__local_depot1[128];
	.reg .b64 	%SP;
	.reg .b64 	%SPL;
	.reg .pred 	%p<99>;
	.reg .f32 	%f<194>;
	.reg .b32 	%r<232>;
	.reg .b64 	%rd<82>;
...
}

Conclusions

To avoid small arrays being placed in local memory, we should avoid using very complex indexing that the compiler cannot determine if they are constant quantities. But the question is how do we know if the compiler can determine if the indexing is constant quantities or not?

In turns out that registers actually cannot be indexed, so does the array placed in registers. If the small array is placed in registers, the equivalent constant indexing of the small array can also be written in the program as well.

For example, the following implementation from the first kernel running_mean_register_array,

constexpr int WindowSize{4};
float window[WindowSize];
float sum{0};
for (int j{0}; j < WindowSize; ++j)
{
    sum += window[j];
}

has an equivalent form as if the declaration of the array window is unnecessary.

float window0, window1, window2, window3;
float sum{0};
sum += window0;
sum += window1;
sum += window2;
sum += window3;

whereas the following implementation from the second kernel running_mean_local_memory_array,

constexpr int WindowSize{4};
float window[WindowSize];
float sum{0};
for (int j{0}; j < WindowSize; ++j)
{
    int const idx{(j + n) % WindowSize};
    sum += window[idx];
}

has no equivalent form as if the declaration of the array window is necessary because the value of n can only be known at the compile time.

Mathematically, it is also equivalent as the following form, but it is a non-trivial task for the compiler to figure out.

float window0, window1, window2, window3;
float sum{0};
sum += window0;
sum += window1;
sum += window2;
sum += window3;

In fact, this is also case for CUDA TensorCore MMA PTX because TensorCore MMA needs to read data from registers for the best performance. For example, the SM80_16x8x8_F16F16F16F16_TN MMA in CUTLASS is implemented as follows and the MMA PTX only accesses registers, even if the buffers were declared as arrays.

// MMA 16x8x8 TN
struct SM80_16x8x8_F16F16F16F16_TN
{
  using DRegisters = uint32_t[2];
  using ARegisters = uint32_t[2];
  using BRegisters = uint32_t[1];
  using CRegisters = uint32_t[2];

  CUTE_HOST_DEVICE static void
  fma(uint32_t      & d0, uint32_t      & d1,
      uint32_t const& a0, uint32_t const& a1,
      uint32_t const& b0,
      uint32_t const& c0, uint32_t const& c1)
  {
#if defined(CUTE_ARCH_MMA_SM80_ENABLED)
    asm volatile(
      "mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 "
      "{%0, %1},"
      "{%2, %3},"
      "{%4},"
      "{%5, %6};\n"
      : "=r"(d0), "=r"(d1)
      :  "r"(a0),  "r"(a1),
         "r"(b0),
         "r"(c0),  "r"(c1));
#else
    CUTE_INVALID_CONTROL_PATH("Attempting to use SM80_16x8x8_F16F16F16F16_TN without CUTE_ARCH_MMA_SM80_ENABLED");
#endif
  }
};

References

Device Memory Accesses - CUDA C Programming Guide

CUDA Local Memory

https://leimao.github.io/blog/CUDA-Local-Memory/

Author

Lei Mao

Posted on

03-19-2025

Updated on

03-19-2025

Licensed under

CUDA,

GPU

CUDA Local Memory

Introduction

CUDA Local Memory

Conclusions

References

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Advertisement

Catalogue