CUDA Kernel Execution Overlap

Introduction

In my previous blog post “CUDA Stream”, I have discussed about how CUDA streams helps the CUDA program achieves concurrency. At the end of the article, I also mentioned that in addition to memory transfer and kernel execution overlap, execution overlap between different kernels is also allowed. However, many CUDA programmers wondered why they have not encountered kernel execution overlap before.

In this blog post, I would like to discuss about the CUDA kernel execution overlap and why we could or could not see them in practice.

CUDA Kernel Execution Overlap

Computation Resources

CUDA kernel executions can overlap if there are sufficient computation resource to parallelize multiple kernel executions.

In the following example, by changing the value of blocks_per_grid from small to large, we could see that the kernel executions from different CUDA streams changes from full-parallelization, to partial-parallelization, and finally to almost no-parallelization. This is because, when the computation resource allocated for one CUDA kernel becomes larger, the computation resource for additional CUDA kernels becomes smaller.

overlap.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
#include <cuda_runtime.h>
#include <iostream>
#include <vector>

#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
template <typename T>
void check(T err, const char* const func, const char* const file,
const int line)
{
if (err != cudaSuccess)
{
std::cerr << "CUDA Runtime Error at: " << file << ":" << line
<< std::endl;
std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
std::exit(EXIT_FAILURE);
}
}

#define CHECK_LAST_CUDA_ERROR() checkLast(__FILE__, __LINE__)
void checkLast(const char* const file, const int line)
{
cudaError_t err{cudaGetLastError()};
if (err != cudaSuccess)
{
std::cerr << "CUDA Runtime Error at: " << file << ":" << line
<< std::endl;
std::cerr << cudaGetErrorString(err) << std::endl;
std::exit(EXIT_FAILURE);
}
}

__global__ void float_add_one(float* buffer, uint32_t n)
{
uint32_t const idx{blockDim.x * blockIdx.x + threadIdx.x};
uint32_t const stride{blockDim.x * gridDim.x};

for (uint32_t i{idx}; i < n; i += stride)
{
buffer[i] += 1.0F;
}
}

void launch_float_add_one(float* buffer, uint32_t n,
dim3 const& threads_per_block,
dim3 const& blocks_per_grid, cudaStream_t stream)
{
float_add_one<<<blocks_per_grid, threads_per_block, 0, stream>>>(buffer, n);
CHECK_LAST_CUDA_ERROR();
}

int main(int argc, char** argv)
{
size_t const buffer_size{1024 * 10240};
size_t const num_streams{5};

dim3 const threads_per_block{1024};
// Try different values for blocks_per_grid
// 1, 2, 4, 8, 16, 32, 1024, 2048
dim3 const blocks_per_grid{32};

std::vector<float*> d_buffers(num_streams);
std::vector<cudaStream_t> streams(num_streams);

for (auto& d_buffer : d_buffers)
{
CHECK_CUDA_ERROR(cudaMalloc(&d_buffer, buffer_size * sizeof(float)));
}

for (auto& stream : streams)
{
CHECK_CUDA_ERROR(cudaStreamCreate(&stream));
}

for (size_t i = 0; i < num_streams; ++i)
{
launch_float_add_one(d_buffers[i], buffer_size, threads_per_block,
blocks_per_grid, streams[i]);
}

for (auto& stream : streams)
{
CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));
}

for (auto& d_buffer : d_buffers)
{
CHECK_CUDA_ERROR(cudaFree(d_buffer));
}

for (auto& stream : streams)
{
CHECK_CUDA_ERROR(cudaStreamDestroy(stream));
}

return 0;
}
1
2
$ nvcc overlap.cu -o overlap    
$ ./overlap

We observed full-parallelization for blocks_per_grid = 1. However, we could also see that the time spent for finishing all the kernels was long because the GPU was not fully utilized.

blocks_per_grid = 1

When we set blocks_per_grid = 32, only some of the kernel executions were parallelized. However, the GPU was fully utilized and the time spent for finishing all the kernels was much less compared to the blocks_per_grid = 1.

blocks_per_grid = 32

Same as blocks_per_grid = 32, when we set blocks_per_grid = 5120, there was almost no kernel executions parallelized. However, the GPU was still fully utilized and the time spent for finishing all the kernels was much less compared to the blocks_per_grid = 1.

blocks_per_grid = 5120

Implicit Synchronization

It is also possible that there is no kernel execution overlap even if there are sufficient computation resources. It could be due to that there are CUDA commands issued by the host thread to the default stream between other CUDA commands from other different streams causing implicit synchronization.

In my opinion, this rarely happens in the single-threaded CUDA programs due to the way CUDA programmers usually writes CUDA programs. However, it will definitely happen for the multi-threaded CUDA programs. To overcome this situation, since CUDA 7, a per-thread default stream compile mode has been created. The user would just have to specify --default-stream per-thread in the NVCC compiler building flags without having to change the existing CUDA program to disable implicit synchronization. To see more details about how to simplify CUDA concurrency using per-thread default stream, please read Mark Harris’s blog post.

As of CUDA 11.4, the default building argument is still legacy. The user would have to manually change it to per-thread in order to use the per-thread default stream. From the CUDA 11.4 NVCC help:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
--default-stream {legacy|null|per-thread}       (-default-stream)               
Specify the stream that CUDA commands from the compiled program will be sent
to by default.

legacy
The CUDA legacy stream (per context, implicitly synchronizes with
other streams).

per-thread
A normal CUDA stream (per thread, does not implicitly
synchronize with other streams).

'null' is a deprecated alias for 'legacy'.

Allowed values for this option: 'legacy','null','per-thread'.
Default value: 'legacy'.

Conclusions

If there is no implicit synchronization from the default CUDA stream, partial or no CUDA kernel execution parallelization usually indicate high GPU utilization, and full CUDA kernel execution parallelization usually indicate GPU might have not been fully utilized.

If the no CUDA kernel execution overlap was due to the implicit synchronization from the default CUDA stream, we should probably think of disabling it by enabling the per-thread default stream.

References

Author

Lei Mao

Posted on

06-10-2022

Updated on

06-10-2022

Licensed under


Comments