CUDA Shared Memory Bank Conflict-Free Vectorized Access

Introduction

In CUDA programming, it is common to use vectorized data access to move the data from global memory to shared memory and vice versa, because vectorized access uses fewer instructions and can usually results in higher memory throughput. For example, LDG.128 (vectorized global load) is significantly faster than four LDG.32 calls because it saturates the memory bus more effectively. However, when using vectorized access to read or write shared memory, a natural question to ask is whether shared memory bank conflict will occur. For example, can LDS.128 result in a 4-way bank conflict?

In this blog post, I will implement a CUDA kernel tocompare the performance of vectorized access and scalar access to shared memory, and discuss how vectorized access achieves bank conflict-free access to shared memory.

CUDA Shared Memory Vectorized Access and Bank Conflict

I implemented two versions of a CUDA kernel that performs the same computation on the data in shared memory repeatedly, but one version uses vectorized access to read and write shared memory, while the other version uses scalar access. So if the vectorized access version has worse performance because of bank conflict, we should be able to see a significant performance difference between the two versions.

shared_memory_vectorization.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
// nvcc shared_memory_vectorization.cu -o shared_memory_vectorization
// ncu --set full -o ncu_report_full -f ./shared_memory_vectorization
// --num_iterations 1 --num_warmups 0

#include <cuda_runtime.h>

#include <cmath>
#include <cstdio>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <vector>

// Error checking function and macro
#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
void check(cudaError_t err, char const* func, char const* file, int line)
{
if (err != cudaSuccess)
{
std::cerr << "CUDA Runtime Error at: " << file << ":" << line
<< std::endl;
std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
std::exit(EXIT_FAILURE);
}
}

// Template variables for kernel configuration
template <int BlockSize, int ComputeIters>
struct KernelConfig
{
static constexpr int BLOCK_SIZE = BlockSize;
static constexpr int COMPUTE_ITERS = ComputeIters;
static constexpr int ELEMENTS_PER_THREAD = 4;
static constexpr int SMEM_SIZE = BLOCK_SIZE * ELEMENTS_PER_THREAD;
};

// Device function to perform computation on a single float value
__device__ __forceinline__ void process_smem_value(float& val)
{
val = val * 1.01f + 0.01f;
}

// Device function for scalar shared memory computation
// Bank-conflict-free: strided access pattern
template <typename Config>
__device__ __forceinline__ void compute_scalar(float* smem, unsigned int tid)
{
#pragma unroll
for (int iter{0}; iter < Config::COMPUTE_ITERS; iter++)
{
process_smem_value(smem[tid]);
process_smem_value(smem[tid + Config::BLOCK_SIZE]);
process_smem_value(smem[tid + 2 * Config::BLOCK_SIZE]);
process_smem_value(smem[tid + 3 * Config::BLOCK_SIZE]);
}
}

// Device function for vectorized shared memory computation
template <typename Config>
__device__ __forceinline__ void compute_vectorized(float* smem,
unsigned int tid)
{
#pragma unroll
for (int iter{0}; iter < Config::COMPUTE_ITERS; iter++)
{
float4 val = *reinterpret_cast<float4*>(&smem[tid * 4]);
process_smem_value(val.x);
process_smem_value(val.y);
process_smem_value(val.z);
process_smem_value(val.w);
*reinterpret_cast<float4*>(&smem[tid * 4]) = val;
}
}

// Unified kernel template
// Assumes: n is a multiple of Config::SMEM_SIZE
template <typename Config, bool UseVectorized>
__global__ void kernel_shared_memory(const float* __restrict__ input,
float* __restrict__ output, int n)
{
__shared__ float smem[Config::SMEM_SIZE];

unsigned int const tid{threadIdx.x};
unsigned int const idx{blockIdx.x * blockDim.x + threadIdx.x};

// Load from global memory using vectorized access
float4 const global_data_input{
*reinterpret_cast<const float4*>(&input[idx * 4])};

// Store to shared memory without bank conflicts (strided access)
smem[tid] = global_data_input.x;
smem[tid + Config::BLOCK_SIZE] = global_data_input.y;
smem[tid + 2 * Config::BLOCK_SIZE] = global_data_input.z;
smem[tid + 3 * Config::BLOCK_SIZE] = global_data_input.w;
__syncthreads();

// Perform computation - scalar or vectorized based on template parameter
if constexpr (UseVectorized)
{
compute_vectorized<Config>(smem, tid);
}
else
{
compute_scalar<Config>(smem, tid);
}
__syncthreads();

// Load from shared memory using vectorized access
float4 const global_data_output{smem[tid], smem[tid + Config::BLOCK_SIZE],
smem[tid + 2 * Config::BLOCK_SIZE],
smem[tid + 3 * Config::BLOCK_SIZE]};

// Write back to global memory using vectorized access
*reinterpret_cast<float4*>(&output[idx * 4]) = global_data_output;
}

// Default configuration
using DefaultConfig = KernelConfig<256, 100>;

void run_benchmark(int n, int num_iterations, int num_warmups)
{
// Round up n to be a multiple of DefaultConfig::SMEM_SIZE
int const aligned_n{
((n + DefaultConfig::SMEM_SIZE - 1) / DefaultConfig::SMEM_SIZE) *
DefaultConfig::SMEM_SIZE};
size_t const bytes{aligned_n * sizeof(float)};

// Allocate host memory
std::vector<float> h_input(aligned_n);
std::vector<float> h_output_ordinary(aligned_n);
std::vector<float> h_output_vectorized(aligned_n);

// Initialize input data
for (int i{0}; i < aligned_n; ++i)
{
h_input[i] = static_cast<float>(i);
}

// Allocate device memory
float *d_input, *d_output;
CHECK_CUDA_ERROR(cudaMalloc(&d_input, bytes));
CHECK_CUDA_ERROR(cudaMalloc(&d_output, bytes));

// Copy input to device
CHECK_CUDA_ERROR(
cudaMemcpy(d_input, h_input.data(), bytes, cudaMemcpyHostToDevice));

// Setup kernel launch parameters
int const threads_per_block{DefaultConfig::BLOCK_SIZE};

// Both kernels use the same grid/block configuration (each thread handles 4
// elements)
int const num_blocks{
aligned_n / (threads_per_block * DefaultConfig::ELEMENTS_PER_THREAD)};

// Warm up
for (int i{0}; i < num_warmups; ++i)
{
kernel_shared_memory<DefaultConfig, false>
<<<num_blocks, threads_per_block>>>(d_input, d_output, aligned_n);
kernel_shared_memory<DefaultConfig, true>
<<<num_blocks, threads_per_block>>>(d_input, d_output, aligned_n);
}
CHECK_CUDA_ERROR(cudaDeviceSynchronize());

// Benchmark ordinary kernel
cudaEvent_t start, stop;
CHECK_CUDA_ERROR(cudaEventCreate(&start));
CHECK_CUDA_ERROR(cudaEventCreate(&stop));

CHECK_CUDA_ERROR(cudaEventRecord(start));
for (int i{0}; i < num_iterations; ++i)
{
kernel_shared_memory<DefaultConfig, false>
<<<num_blocks, threads_per_block>>>(d_input, d_output, aligned_n);
}
CHECK_CUDA_ERROR(cudaEventRecord(stop));
CHECK_CUDA_ERROR(cudaEventSynchronize(stop));

float time_ordinary{0.0f};
CHECK_CUDA_ERROR(cudaEventElapsedTime(&time_ordinary, start, stop));

// Copy result
CHECK_CUDA_ERROR(cudaMemcpy(h_output_ordinary.data(), d_output, bytes,
cudaMemcpyDeviceToHost));

// Benchmark vectorized kernel
CHECK_CUDA_ERROR(cudaEventRecord(start));
for (int i{0}; i < num_iterations; ++i)
{
kernel_shared_memory<DefaultConfig, true>
<<<num_blocks, threads_per_block>>>(d_input, d_output, aligned_n);
}
CHECK_CUDA_ERROR(cudaEventRecord(stop));
CHECK_CUDA_ERROR(cudaEventSynchronize(stop));

float time_vectorized{0.0f};
CHECK_CUDA_ERROR(cudaEventElapsedTime(&time_vectorized, start, stop));

// Copy result
CHECK_CUDA_ERROR(cudaMemcpy(h_output_vectorized.data(), d_output, bytes,
cudaMemcpyDeviceToHost));

// Verify results match
bool results_match{true};
for (int i{0}; i < aligned_n; ++i)
{
if (std::fabs(h_output_ordinary[i] - h_output_vectorized[i]) > 1e-5)
{
std::cout << "Mismatch at index " << i
<< ": ordinary = " << h_output_ordinary[i]
<< ", vectorized = " << h_output_vectorized[i]
<< std::endl;
results_match = false;
break;
}
}

// Print results
std::cout << std::endl;
std::cout << "========================================" << std::endl;
std::cout << "Array size: " << aligned_n << " elements (" << std::fixed
<< std::setprecision(2) << bytes / 1024.0 / 1024.0 << " MB)"
<< std::endl;
std::cout << "Iterations: " << num_iterations << std::endl;
std::cout << "========================================" << std::endl;
std::cout << "Ordinary access time: " << std::fixed
<< std::setprecision(4) << time_ordinary
<< " ms (avg: " << time_ordinary / num_iterations << " ms)"
<< std::endl;
std::cout << "Vectorized access time: " << std::fixed
<< std::setprecision(4) << time_vectorized
<< " ms (avg: " << time_vectorized / num_iterations << " ms)"
<< std::endl;
std::cout << "Speedup: " << std::fixed << std::setprecision(2)
<< time_ordinary / time_vectorized << "x" << std::endl;
std::cout << "Results match: " << (results_match ? "YES" : "NO")
<< std::endl;
std::cout << "========================================" << std::endl;

// Cleanup
CHECK_CUDA_ERROR(cudaEventDestroy(start));
CHECK_CUDA_ERROR(cudaEventDestroy(stop));
CHECK_CUDA_ERROR(cudaFree(d_input));
CHECK_CUDA_ERROR(cudaFree(d_output));
}

int main(int argc, char* argv[])
{
// Print device information
int device;
cudaDeviceProp prop;
CHECK_CUDA_ERROR(cudaGetDevice(&device));
CHECK_CUDA_ERROR(cudaGetDeviceProperties(&prop, device));

std::cout << "Device: " << prop.name << std::endl;
std::cout << "Shared Memory per Block: " << prop.sharedMemPerBlock / 1024
<< " KB" << std::endl;
std::cout << "========================================" << std::endl;

// Parse command line arguments
int num_iterations{1000};
int num_warmups{10};

for (int i{1}; i < argc; ++i)
{
std::string arg{argv[i]};
if (arg == "--num_iterations" && i + 1 < argc)
{
num_iterations = std::atoi(argv[++i]);
}
else if (arg == "--num_warmups" && i + 1 < argc)
{
num_warmups = std::atoi(argv[++i]);
}
else if (arg == "--help" || arg == "-h")
{
std::cout << "Usage: " << argv[0]
<< " [--num_iterations N] [--num_warmups N]" << std::endl;
std::cout << " --num_iterations N Number of benchmark iterations "
"(default: 1000)"
<< std::endl;
std::cout << " --num_warmups N Number of warmup iterations "
"(default: 10)"
<< std::endl;
return 0;
}
}

std::cout << "Configuration: " << num_iterations << " iterations, "
<< num_warmups << " warmups" << std::endl;

// Run benchmarks with different sizes
std::cout << std::endl << "Test 1: Small array" << std::endl;
run_benchmark(1024 * 256, num_iterations, num_warmups); // 1 MB

std::cout << std::endl << "Test 2: Medium array" << std::endl;
run_benchmark(1024 * 1024 * 16, num_iterations, num_warmups); // 64 MB

std::cout << std::endl << "Test 3: Large array" << std::endl;
run_benchmark(1024 * 1024 * 64, num_iterations, num_warmups); // 256 MB

return 0;
}

The benchmark shows that the performance of the vectorized access version is almost the same as the scalar access version. Therefore, there should be no shared memory bank conflict when using vectorized access to read or write shared memory. Otherwise, the performance of the vectorized access version could not be on par with the scalar access version, which definitely does not have shared memory bank conflict.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
$ nvcc shared_memory_vectorization.cu -o shared_memory_vectorization
$ ./shared_memory_vectorization
Device: NVIDIA GeForce RTX 5080
Shared Memory per Block: 48 KB
========================================
Configuration: 1000 iterations, 10 warmups

Test 1: Small array

========================================
Array size: 262144 elements (1.00 MB)
Iterations: 1000
========================================
Ordinary access time: 4.0968 ms (avg: 0.0041 ms)
Vectorized access time: 4.0987 ms (avg: 0.0041 ms)
Speedup: 1.00x
Results match: YES
========================================

Test 2: Medium array

========================================
Array size: 16777216 elements (64.00 MB)
Iterations: 1000
========================================
Ordinary access time: 165.9853 ms (avg: 0.1660 ms)
Vectorized access time: 168.8335 ms (avg: 0.1688 ms)
Speedup: 0.98x
Results match: YES
========================================

Test 3: Large array

========================================
Array size: 67108864 elements (256.00 MB)
Iterations: 1000
========================================
Ordinary access time: 682.5273 ms (avg: 0.6825 ms)
Vectorized access time: 683.1362 ms (avg: 0.6831 ms)
Speedup: 1.00x
Results match: YES
========================================

Nsight Compute analysis also confirmed that LDS.128 (vectorized shared memory load) and LDS.32 (scalar shared memory load) are used for the SASS of the vectorized access version and the scalar access version, respectively. In addition, l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum seems to indicate taht the vectorized access version has very high shared memory bank conflicts, whereas the scalar access version has zero or almost zero shared memory bank conflicts. This is, however, very illusive. If we look at the l1tex__data_pipe_lsu_wavefronts_mem_shared.sum, which indicates the total number of wavefronts that access shared memory, the ratio of l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum to l1tex__data_pipe_lsu_wavefronts_mem_shared.sum is actually almost negligible for both implementations. For example, the l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum and l1tex__data_pipe_lsu_wavefronts_mem_shared.sum are 25,292 and 8,904,760, respectively, for the vectorized access version, which means even if there are shared memory bank conflicts, they only happen in 0.28% of the shared memory accesses, which is not significant enough to cause a noticeable performance degradation.

Nsight Compute Analysis

The shared memory has 32 banks, and each bank is 32-bit, i.e., 4-byte wide. The hardware can only serve 32 unique requests, i.e., 128-byte of data, to shared memory per cycle. If a warp of 32 threads tries to access shared memory using 128-bit vectorized access, at least 4 cycles are needed to serve all the requests. To make it bank conflict-free, the instruction is split into 4 phases. In phase 1, the first 8 threads from lane 0 to 7 access the banks. In phase 2, the next 8 threads from lane 8 to 15 access the banks. In phase 3, the next 8 threads from lane 16 to 23 access the banks. In phase 4, the last 8 threads from lane 24 to 31 access the banks. It is bank conflict-free in each phase. Therefore, vectorized access to shared memory can still be shared memory bank conflict-free and reaches the highest efficiency on hardware.

In fact, Nsight Compute analysis shows that the derived__memory_l1_wavefronts_shared_excessive, which indicates the number of wavefronts that are spent excessively on shared memory, is exactly zero for both the vectorized access version and the scalar access version. This means both implementations are shared memory bank conflict-free and have achieved the highest memory efficiency on hardware. The advantage of using vectorized access to shared memory is that it can reduce the number of instructions and thus reduce the instruction overhead, which can lead to better performance in some cases.

Conclusions

In conclusion, GPU is natively designed to support vectorized access to shared memory without causing bank conflict. Therefore, with careful design of the shared memory access pattern, it is possible to use vectorized access to read and write shared memory without causing bank conflict. The performance of the vectorized access version can be on par with the scalar access version, and both versions can achieve the highest efficiency on hardware without excessive shared memory bank conflicts.

References

Author

Lei Mao

Posted on

02-13-2026

Updated on

02-13-2026

Licensed under


Comments