In CUDA programming, it is common to use vectorized data access to move the data from global memory to shared memory and vice versa, because vectorized access uses fewer instructions and can usually results in higher memory throughput. For example, LDG.128 (vectorized global load) is significantly faster than four LDG.32 calls because it saturates the memory bus more effectively. However, when using vectorized access to read or write shared memory, a natural question to ask is whether shared memory bank conflict will occur. For example, can LDS.128 result in a 4-way bank conflict?
In this blog post, I will implement a CUDA kernel tocompare the performance of vectorized access and scalar access to shared memory, and discuss how vectorized access achieves bank conflict-free access to shared memory.
CUDA Shared Memory Vectorized Access and Bank Conflict
I implemented two versions of a CUDA kernel that performs the same computation on the data in shared memory repeatedly, but one version uses vectorized access to read and write shared memory, while the other version uses scalar access. So if the vectorized access version has worse performance because of bank conflict, we should be able to see a significant performance difference between the two versions.
// Device function to perform computation on a single float value __device__ __forceinline__ voidprocess_smem_value(float& val) { val = val * 1.01f + 0.01f; }
// Write back to global memory using vectorized access *reinterpret_cast<float4*>(&output[idx * 4]) = global_data_output; }
// Default configuration using DefaultConfig = KernelConfig<256, 100>;
voidrun_benchmark(int n, int num_iterations, int num_warmups) { // Round up n to be a multiple of DefaultConfig::SMEM_SIZE intconst aligned_n{ ((n + DefaultConfig::SMEM_SIZE - 1) / DefaultConfig::SMEM_SIZE) * DefaultConfig::SMEM_SIZE}; size_tconst bytes{aligned_n * sizeof(float)};
The benchmark shows that the performance of the vectorized access version is almost the same as the scalar access version. Therefore, there should be no shared memory bank conflict when using vectorized access to read or write shared memory. Otherwise, the performance of the vectorized access version could not be on par with the scalar access version, which definitely does not have shared memory bank conflict.
Nsight Compute analysis also confirmed that LDS.128 (vectorized shared memory load) and LDS.32 (scalar shared memory load) are used for the SASS of the vectorized access version and the scalar access version, respectively. In addition, l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum seems to indicate taht the vectorized access version has very high shared memory bank conflicts, whereas the scalar access version has zero or almost zero shared memory bank conflicts. This is, however, very illusive. If we look at the l1tex__data_pipe_lsu_wavefronts_mem_shared.sum, which indicates the total number of wavefronts that access shared memory, the ratio of l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum to l1tex__data_pipe_lsu_wavefronts_mem_shared.sum is actually almost negligible for both implementations. For example, the l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum and l1tex__data_pipe_lsu_wavefronts_mem_shared.sum are 25,292 and 8,904,760, respectively, for the vectorized access version, which means even if there are shared memory bank conflicts, they only happen in 0.28% of the shared memory accesses, which is not significant enough to cause a noticeable performance degradation.
The shared memory has 32 banks, and each bank is 32-bit, i.e., 4-byte wide. The hardware can only serve 32 unique requests, i.e., 128-byte of data, to shared memory per cycle. If a warp of 32 threads tries to access shared memory using 128-bit vectorized access, at least 4 cycles are needed to serve all the requests. To make it bank conflict-free, the instruction is split into 4 phases. In phase 1, the first 8 threads from lane 0 to 7 access the banks. In phase 2, the next 8 threads from lane 8 to 15 access the banks. In phase 3, the next 8 threads from lane 16 to 23 access the banks. In phase 4, the last 8 threads from lane 24 to 31 access the banks. It is bank conflict-free in each phase. Therefore, vectorized access to shared memory can still be shared memory bank conflict-free and reaches the highest efficiency on hardware.
In fact, Nsight Compute analysis shows that the derived__memory_l1_wavefronts_shared_excessive, which indicates the number of wavefronts that are spent excessively on shared memory, is exactly zero for both the vectorized access version and the scalar access version. This means both implementations are shared memory bank conflict-free and have achieved the highest memory efficiency on hardware. The advantage of using vectorized access to shared memory is that it can reduce the number of instructions and thus reduce the instruction overhead, which can lead to better performance in some cases.
Conclusions
In conclusion, GPU is natively designed to support vectorized access to shared memory without causing bank conflict. Therefore, with careful design of the shared memory access pattern, it is possible to use vectorized access to read and write shared memory without causing bank conflict. The performance of the vectorized access version can be on par with the scalar access version, and both versions can achieve the highest efficiency on hardware without excessive shared memory bank conflicts.