## Introduction

In CUDA programming, to achieve the maximum utilization of GPU, we will often use multiple CUDA streams in the implementation. Then we have a question. Do we implement the CUDA program in a multi-thread fashion and each thread uses one CUDA stream or a single-thread fashion and the thread uses multiple CUDA streams?

In this blog post, I implemented a high-performance addition program and compared the performance between multi-thread single-stream CUDA and single-thread multi-stream CUDA.

In this example, I implemented the array addition using CPU and CUDA. We could adjust the array size, number of additions to perform, number of threads, and number of CUDA streams per thread, and measure the performance latency.

All the tests were performed on an x86-64 Ubuntu 20.04 LTS desktop with Intel i9-9900K CPU and NVIDIA RTX 2080 TI GPU.

To build the application, please run the following command in the terminal.

The application consumes $4$ arguments, array size, number of additions to perform, number of threads, and the number of CUDA streams per thread.

For example, ./add 100 10 8 1 means running the application for an array of size 100, performing addition 10 times, distributed across 8 threads and each thread uses 1 CUDA stream.

Similarly, ./add 100 10 8 0 means running the application for an array of size 100, performing addition 10 times, distributed across 8 threads using CPU only.

## Math-Bound VS Memory-Bound

In my previous blog post “Math-Bound VS Memory-Bound Operations”, we have discussed math-bound and memory-bound operations. In our particular program, we could adjust the operation to be math-bound or memory-bound by adjusting the number of additions.

From the performance measured, we could see that GPU is extremely good at performing math-bound operations, whereas for memory-bound operations GPU did not show significant advantages.

In fact, even performing addition $100$ times does not make the operation math-bound on GPU. The time spent on executing the kernel is only $1.48%$, whereas the rest of the time were spent on memory copy.

If we increase the number of additions to $1000000$ for which the CPU can hardly handle, GPU could still perform extremely well. The operation has also become math-bound, since the time spent on executing the kernel is now $97.43%$.

Here we tried to compare the performance between multi-thread single-stream CUDA and single-thread multi-stream CUDA. Concretely, we compared the addition performance for the following two case:

From the performance latency we measured, we could see that for the math-bound operations there is no significant performance difference between the two cases, whereas for the memory-bound operations the single-thread multi-stream implementation is faster.

## Summary

All the experiment results are summarized below.

10000000 1 16 0 2.90
10000000 1 16 1 12.20
10000000 1 1 16 9.08
10000000 100 16 0 176.47
10000000 100 16 1 10.93
10000000 1000000 16 1 242.83
10000000 1000000 1 16 250.37
100000000 1 16 1 64.67
100000000 1 1 16 70.82

## Conclusion

The latency difference between multi-thread single-stream CUDA and single-thread multi-stream CUDA is small.

Lei Mao

10-18-2021

05-12-2022