Multi-Thread Single-Stream VS Single-Thread Multi-Stream CUDA
Introduction
In CUDA programming, to achieve the maximum utilization of GPU, we will often use multiple CUDA streams in the implementation. Then we have a question. Do we implement the CUDA program in a multi-thread fashion and each thread uses one CUDA stream or a single-thread fashion and the thread uses multiple CUDA streams?
In this blog post, I implemented a high-performance addition program and compared the performance between multi-thread single-stream CUDA and single-thread multi-stream CUDA.
High-Performance Addition
In this example, I implemented the array addition using CPU and CUDA. We could adjust the array size, number of additions to perform, number of threads, and number of CUDA streams per thread, and measure the performance latency.
All the tests were performed on an x86-64 Ubuntu 20.04 LTS desktop with Intel i9-9900K CPU and NVIDIA RTX 2080 TI GPU.
1 |
|
To build the application, please run the following command in the terminal.
1 | $ docker run -it --rm --gpus all --privileged -v $(pwd):/mnt -w /mnt nvcr.io/nvidia/cuda:11.4.1-cudnn8-devel-ubuntu20.04 |
The application consumes $4$ arguments, array size, number of additions to perform, number of threads, and the number of CUDA streams per thread.
For example, ./add 100 10 8 1
means running the application for an array of size 100
, performing addition 10
times, distributed across 8
threads and each thread uses 1
CUDA stream.
1 | $ ./add 100 10 8 1 |
Similarly, ./add 100 10 8 0
means running the application for an array of size 100
, performing addition 10
times, distributed across 8
threads using CPU only.
1 | $ ./add 100 10 8 0 |
Math-Bound VS Memory-Bound
In my previous blog post “Math-Bound VS Memory-Bound Operations”, we have discussed math-bound and memory-bound operations. In our particular program, we could adjust the operation to be math-bound or memory-bound by adjusting the number of additions.
From the performance measured, we could see that GPU is extremely good at performing math-bound operations, whereas for memory-bound operations GPU did not show significant advantages.
1 | $ ./add 10000000 100 16 0 |
In fact, even performing addition $100$ times does not make the operation math-bound on GPU. The time spent on executing the kernel is only $1.48%$, whereas the rest of the time were spent on memory copy.
1 | $ nvprof ./add 10000000 100 16 1 |
If we increase the number of additions to $1000000$ for which the CPU can hardly handle, GPU could still perform extremely well. The operation has also become math-bound, since the time spent on executing the kernel is now $97.43%$.
1 | $ nvprof ./add 10000000 1000000 16 1 |
Multi-Thread Single-Stream VS Single-Thread Multi-Stream
Here we tried to compare the performance between multi-thread single-stream CUDA and single-thread multi-stream CUDA. Concretely, we compared the addition performance for the following two case:
- 1 thread and the thread has 16 CUDA streams.
- 16 threads and each thread has 1 CUDA stream.
From the performance latency we measured, we could see that for the math-bound operations there is no significant performance difference between the two cases, whereas for the memory-bound operations the single-thread multi-stream implementation is faster.
1 | $ ./add 100000000 1 16 1 |
Summary
All the experiment results are summarized below.
Array Size | Number of Additions | Number of Threads | Number of Streams Per Thread | Average Latency (ms) |
---|---|---|---|---|
10000000 | 1 | 16 | 0 | 2.90 |
10000000 | 1 | 16 | 1 | 12.20 |
10000000 | 1 | 1 | 16 | 9.08 |
10000000 | 100 | 16 | 0 | 176.47 |
10000000 | 100 | 16 | 1 | 10.93 |
10000000 | 1000000 | 16 | 1 | 242.83 |
10000000 | 1000000 | 1 | 16 | 250.37 |
100000000 | 1 | 16 | 1 | 64.67 |
100000000 | 1 | 1 | 16 | 70.82 |
Conclusion
The latency difference between multi-thread single-stream CUDA and single-thread multi-stream CUDA is small.
Multi-Thread Single-Stream VS Single-Thread Multi-Stream CUDA
https://leimao.github.io/blog/Multi-Thread-Single-Stream-VS-Single-Thread-Multi-Stream-CUDA/