CUDA Shared Memory Bank Conflict-Free Vectorized Access

02-13-202602-13-2026 blog 14 minutes read (About 2060 words) visits

Introduction

In CUDA programming, it is common to use vectorized data access to move the data from global memory to shared memory and vice versa, because vectorized access uses fewer instructions and can usually results in higher memory throughput. For example, LDG.128 (vectorized global load) is significantly faster than four LDG.32 calls because it saturates the memory bus more effectively. However, when using vectorized access to read or write shared memory, a natural question to ask is whether shared memory bank conflict will occur. For example, can LDS.128 result in a 4-way bank conflict?

In this blog post, I will implement a CUDA kernel tocompare the performance of vectorized access and scalar access to shared memory, and discuss how vectorized access is bank conflict-free access on shared memory.

CUDA Shared Memory Vectorized Access and Bank Conflict

I implemented two versions of a CUDA kernel that performs the same computation on the data in shared memory repeatedly, but one version uses vectorized access to read and write shared memory, while the other version uses scalar access. So if the vectorized access version has worse performance because of bank conflict, we should be able to see a significant performance difference between the two versions.

shared_memory_vectorization.cu

// nvcc shared_memory_vectorization.cu -o shared_memory_vectorization
// ncu --set full -o ncu_report_full -f ./shared_memory_vectorization
// --num_iterations 1 --num_warmups 0

#include <cuda_runtime.h>

#include <cmath>
#include <cstdio>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <vector>

// Error checking function and macro
#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
void check(cudaError_t err, char const* func, char const* file, int line)
{
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
        std::exit(EXIT_FAILURE);
    }
}

// Template variables for kernel configuration
template <int BlockSize, int ComputeIters>
struct KernelConfig
{
    static constexpr int BLOCK_SIZE = BlockSize;
    static constexpr int COMPUTE_ITERS = ComputeIters;
    static constexpr int ELEMENTS_PER_THREAD = 4;
    static constexpr int SMEM_SIZE = BLOCK_SIZE * ELEMENTS_PER_THREAD;
};

// Device function to perform computation on a single float value
__device__ __forceinline__ void process_smem_value(float& val)
{
    val = val * 1.01f + 0.01f;
}

// Device function for scalar shared memory computation
// Bank-conflict-free: strided access pattern
template <typename Config>
__device__ __forceinline__ void compute_scalar(float* smem, unsigned int tid)
{
#pragma unroll
    for (int iter{0}; iter < Config::COMPUTE_ITERS; iter++)
    {
        process_smem_value(smem[tid]);
        process_smem_value(smem[tid + Config::BLOCK_SIZE]);
        process_smem_value(smem[tid + 2 * Config::BLOCK_SIZE]);
        process_smem_value(smem[tid + 3 * Config::BLOCK_SIZE]);
    }
}

// Device function for vectorized shared memory computation
template <typename Config>
__device__ __forceinline__ void compute_vectorized(float* smem,
                                                   unsigned int tid)
{
#pragma unroll
    for (int iter{0}; iter < Config::COMPUTE_ITERS; iter++)
    {
        float4 val = *reinterpret_cast<float4*>(&smem[tid * 4]);
        process_smem_value(val.x);
        process_smem_value(val.y);
        process_smem_value(val.z);
        process_smem_value(val.w);
        *reinterpret_cast<float4*>(&smem[tid * 4]) = val;
    }
}

// Unified kernel template
// Assumes: n is a multiple of Config::SMEM_SIZE
template <typename Config, bool UseVectorized>
__global__ void kernel_shared_memory(const float* __restrict__ input,
                                     float* __restrict__ output, int n)
{
    __shared__ float smem[Config::SMEM_SIZE];

    unsigned int const tid{threadIdx.x};
    unsigned int const idx{blockIdx.x * blockDim.x + threadIdx.x};

    // Load from global memory using vectorized access
    float4 const global_data_input{
        *reinterpret_cast<const float4*>(&input[idx * 4])};

    // Store to shared memory without bank conflicts (strided access)
    smem[tid] = global_data_input.x;
    smem[tid + Config::BLOCK_SIZE] = global_data_input.y;
    smem[tid + 2 * Config::BLOCK_SIZE] = global_data_input.z;
    smem[tid + 3 * Config::BLOCK_SIZE] = global_data_input.w;
    __syncthreads();

    // Perform computation - scalar or vectorized based on template parameter
    if constexpr (UseVectorized)
    {
        compute_vectorized<Config>(smem, tid);
    }
    else
    {
        compute_scalar<Config>(smem, tid);
    }
    __syncthreads();

    // Load from shared memory using vectorized access
    float4 const global_data_output{smem[tid], smem[tid + Config::BLOCK_SIZE],
                                    smem[tid + 2 * Config::BLOCK_SIZE],
                                    smem[tid + 3 * Config::BLOCK_SIZE]};

    // Write back to global memory using vectorized access
    *reinterpret_cast<float4*>(&output[idx * 4]) = global_data_output;
}

// Default configuration
using DefaultConfig = KernelConfig<256, 100>;

void run_benchmark(int n, int num_iterations, int num_warmups)
{
    // Round up n to be a multiple of DefaultConfig::SMEM_SIZE
    int const aligned_n{
        ((n + DefaultConfig::SMEM_SIZE - 1) / DefaultConfig::SMEM_SIZE) *
        DefaultConfig::SMEM_SIZE};
    size_t const bytes{aligned_n * sizeof(float)};

    // Allocate host memory
    std::vector<float> h_input(aligned_n);
    std::vector<float> h_output_ordinary(aligned_n);
    std::vector<float> h_output_vectorized(aligned_n);

    // Initialize input data
    for (int i{0}; i < aligned_n; ++i)
    {
        h_input[i] = static_cast<float>(i);
    }

    // Allocate device memory
    float *d_input, *d_output;
    CHECK_CUDA_ERROR(cudaMalloc(&d_input, bytes));
    CHECK_CUDA_ERROR(cudaMalloc(&d_output, bytes));

    // Copy input to device
    CHECK_CUDA_ERROR(
        cudaMemcpy(d_input, h_input.data(), bytes, cudaMemcpyHostToDevice));

    // Setup kernel launch parameters
    int const threads_per_block{DefaultConfig::BLOCK_SIZE};

    // Both kernels use the same grid/block configuration (each thread handles 4
    // elements)
    int const num_blocks{
        aligned_n / (threads_per_block * DefaultConfig::ELEMENTS_PER_THREAD)};

    // Warm up
    for (int i{0}; i < num_warmups; ++i)
    {
        kernel_shared_memory<DefaultConfig, false>
            <<<num_blocks, threads_per_block>>>(d_input, d_output, aligned_n);
        kernel_shared_memory<DefaultConfig, true>
            <<<num_blocks, threads_per_block>>>(d_input, d_output, aligned_n);
    }
    CHECK_CUDA_ERROR(cudaDeviceSynchronize());

    // Benchmark ordinary kernel
    cudaEvent_t start, stop;
    CHECK_CUDA_ERROR(cudaEventCreate(&start));
    CHECK_CUDA_ERROR(cudaEventCreate(&stop));

    CHECK_CUDA_ERROR(cudaEventRecord(start));
    for (int i{0}; i < num_iterations; ++i)
    {
        kernel_shared_memory<DefaultConfig, false>
            <<<num_blocks, threads_per_block>>>(d_input, d_output, aligned_n);
    }
    CHECK_CUDA_ERROR(cudaEventRecord(stop));
    CHECK_CUDA_ERROR(cudaEventSynchronize(stop));

    float time_ordinary{0.0f};
    CHECK_CUDA_ERROR(cudaEventElapsedTime(&time_ordinary, start, stop));

    // Copy result
    CHECK_CUDA_ERROR(cudaMemcpy(h_output_ordinary.data(), d_output, bytes,
                                cudaMemcpyDeviceToHost));

    // Benchmark vectorized kernel
    CHECK_CUDA_ERROR(cudaEventRecord(start));
    for (int i{0}; i < num_iterations; ++i)
    {
        kernel_shared_memory<DefaultConfig, true>
            <<<num_blocks, threads_per_block>>>(d_input, d_output, aligned_n);
    }
    CHECK_CUDA_ERROR(cudaEventRecord(stop));
    CHECK_CUDA_ERROR(cudaEventSynchronize(stop));

    float time_vectorized{0.0f};
    CHECK_CUDA_ERROR(cudaEventElapsedTime(&time_vectorized, start, stop));

    // Copy result
    CHECK_CUDA_ERROR(cudaMemcpy(h_output_vectorized.data(), d_output, bytes,
                                cudaMemcpyDeviceToHost));

    // Verify results match
    bool results_match{true};
    for (int i{0}; i < aligned_n; ++i)
    {
        if (std::fabs(h_output_ordinary[i] - h_output_vectorized[i]) > 1e-5)
        {
            std::cout << "Mismatch at index " << i
                      << ": ordinary = " << h_output_ordinary[i]
                      << ", vectorized = " << h_output_vectorized[i]
                      << std::endl;
            results_match = false;
            break;
        }
    }

    // Print results
    std::cout << std::endl;
    std::cout << "========================================" << std::endl;
    std::cout << "Array size: " << aligned_n << " elements (" << std::fixed
              << std::setprecision(2) << bytes / 1024.0 / 1024.0 << " MB)"
              << std::endl;
    std::cout << "Iterations: " << num_iterations << std::endl;
    std::cout << "========================================" << std::endl;
    std::cout << "Ordinary access time:   " << std::fixed
              << std::setprecision(4) << time_ordinary
              << " ms (avg: " << time_ordinary / num_iterations << " ms)"
              << std::endl;
    std::cout << "Vectorized access time: " << std::fixed
              << std::setprecision(4) << time_vectorized
              << " ms (avg: " << time_vectorized / num_iterations << " ms)"
              << std::endl;
    std::cout << "Speedup: " << std::fixed << std::setprecision(2)
              << time_ordinary / time_vectorized << "x" << std::endl;
    std::cout << "Results match: " << (results_match ? "YES" : "NO")
              << std::endl;
    std::cout << "========================================" << std::endl;

    // Cleanup
    CHECK_CUDA_ERROR(cudaEventDestroy(start));
    CHECK_CUDA_ERROR(cudaEventDestroy(stop));
    CHECK_CUDA_ERROR(cudaFree(d_input));
    CHECK_CUDA_ERROR(cudaFree(d_output));
}

int main(int argc, char* argv[])
{
    // Print device information
    int device;
    cudaDeviceProp prop;
    CHECK_CUDA_ERROR(cudaGetDevice(&device));
    CHECK_CUDA_ERROR(cudaGetDeviceProperties(&prop, device));

    std::cout << "Device: " << prop.name << std::endl;
    std::cout << "Shared Memory per Block: " << prop.sharedMemPerBlock / 1024
              << " KB" << std::endl;
    std::cout << "========================================" << std::endl;

    // Parse command line arguments
    int num_iterations{1000};
    int num_warmups{10};

    for (int i{1}; i < argc; ++i)
    {
        std::string arg{argv[i]};
        if (arg == "--num_iterations" && i + 1 < argc)
        {
            num_iterations = std::atoi(argv[++i]);
        }
        else if (arg == "--num_warmups" && i + 1 < argc)
        {
            num_warmups = std::atoi(argv[++i]);
        }
        else if (arg == "--help" || arg == "-h")
        {
            std::cout << "Usage: " << argv[0]
                      << " [--num_iterations N] [--num_warmups N]" << std::endl;
            std::cout << "  --num_iterations N  Number of benchmark iterations "
                         "(default: 1000)"
                      << std::endl;
            std::cout << "  --num_warmups N     Number of warmup iterations "
                         "(default: 10)"
                      << std::endl;
            return 0;
        }
    }

    std::cout << "Configuration: " << num_iterations << " iterations, "
              << num_warmups << " warmups" << std::endl;

    // Run benchmarks with different sizes
    std::cout << std::endl << "Test 1: Small array" << std::endl;
    run_benchmark(1024 * 256, num_iterations, num_warmups); // 1 MB

    std::cout << std::endl << "Test 2: Medium array" << std::endl;
    run_benchmark(1024 * 1024 * 16, num_iterations, num_warmups); // 64 MB

    std::cout << std::endl << "Test 3: Large array" << std::endl;
    run_benchmark(1024 * 1024 * 64, num_iterations, num_warmups); // 256 MB

    return 0;
}

The benchmark shows that the performance of the vectorized access version is almost the same as the scalar access version. Therefore, there should be no shared memory bank conflict when using vectorized access to read or write shared memory. Otherwise, the performance of the vectorized access version could not be on par with the scalar access version, which definitely does not have shared memory bank conflict.

$ nvcc shared_memory_vectorization.cu -o shared_memory_vectorization
$ ./shared_memory_vectorization
Device: NVIDIA GeForce RTX 5080
Shared Memory per Block: 48 KB
========================================
Configuration: 1000 iterations, 10 warmups

Test 1: Small array

========================================
Array size: 262144 elements (1.00 MB)
Iterations: 1000
========================================
Ordinary access time:   4.0968 ms (avg: 0.0041 ms)
Vectorized access time: 4.0987 ms (avg: 0.0041 ms)
Speedup: 1.00x
Results match: YES
========================================

Test 2: Medium array

========================================
Array size: 16777216 elements (64.00 MB)
Iterations: 1000
========================================
Ordinary access time:   165.9853 ms (avg: 0.1660 ms)
Vectorized access time: 168.8335 ms (avg: 0.1688 ms)
Speedup: 0.98x
Results match: YES
========================================

Test 3: Large array

========================================
Array size: 67108864 elements (256.00 MB)
Iterations: 1000
========================================
Ordinary access time:   682.5273 ms (avg: 0.6825 ms)
Vectorized access time: 683.1362 ms (avg: 0.6831 ms)
Speedup: 1.00x
Results match: YES
========================================

Nsight Compute analysis also confirmed that LDS.128 (vectorized shared memory load) and LDS.32 (scalar shared memory load) are used for the SASS of the vectorized access version and the scalar access version, respectively. In addition, l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum seems to indicate taht the vectorized access version has very high shared memory bank conflicts, whereas the scalar access version has zero or almost zero shared memory bank conflicts. This is, however, very illusive. If we look at the l1tex__data_pipe_lsu_wavefronts_mem_shared.sum, which indicates the total number of wavefronts that access shared memory, the ratio of l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum to l1tex__data_pipe_lsu_wavefronts_mem_shared.sum is actually almost negligible for both implementations. For example, the l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum and l1tex__data_pipe_lsu_wavefronts_mem_shared.sum are 25,292 and 8,904,760, respectively, for the vectorized access version, which means even if there are shared memory bank conflicts, they only happen in 0.28% of the shared memory accesses, which is not significant enough to cause a noticeable performance degradation.

The shared memory has 32 banks, and each bank is 32-bit, i.e., 4-byte wide. The hardware can only serve 32 unique requests, i.e., 128-byte of data, to shared memory per cycle. If a warp of 32 threads tries to access shared memory using 128-bit vectorized access, at least 4 cycles are needed to serve all the requests. To make it bank conflict-free, the instruction is split into 4 phases. In phase 1, the first 8 threads from lane 0 to 7 access the banks. In phase 2, the next 8 threads from lane 8 to 15 access the banks. In phase 3, the next 8 threads from lane 16 to 23 access the banks. In phase 4, the last 8 threads from lane 24 to 31 access the banks. It is bank conflict-free in each phase. Therefore, vectorized access to shared memory can still be shared memory bank conflict-free and reaches the highest efficiency on hardware.

In fact, Nsight Compute analysis shows that the derived__memory_l1_wavefronts_shared_excessive, which indicates the number of wavefronts that are spent excessively on shared memory, is exactly zero for both the vectorized access version and the scalar access version. This means both implementations are shared memory bank conflict-free and have achieved the highest memory efficiency on hardware. The advantage of using vectorized access to shared memory is that it can reduce the number of instructions and thus reduce the instruction overhead, which can lead to better performance in some cases.

Conclusions

In conclusion, GPU is natively designed to support vectorized access to shared memory without causing bank conflict. Therefore, with careful design of the shared memory access pattern, it is possible to use vectorized access to read and write shared memory without causing bank conflict. The performance of the vectorized access version can be on par with the scalar access version, and both versions can achieve the highest efficiency on hardware without excessive shared memory bank conflicts.

References

CUDA Shared Memory Bank

CUDA Shared Memory Bank Conflict-Free Vectorized Access

https://leimao.github.io/blog/CUDA-Shared-Memory-Bank-Conflict-Free-Vectorized-Access/

Author

Lei Mao

Posted on

02-13-2026

Updated on

02-13-2026

Licensed under

CUDA,

NVIDIA,

Parallel Computing,

GPU

CUDA Shared Memory Bank Conflict-Free Vectorized Access

Introduction

CUDA Shared Memory Vectorized Access and Bank Conflict

Conclusions

References

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Advertisement

Catalogue