CUDA default stream can have different synchronization behaviors in different scenarios. Sometimes, it helps the program to run correctly even if we made some mistakes in assigning the CUDA streams to different kernels.
In this blog post, I would like to introduce the two types of the CUDA default streams, the default legacy stream and the default per-thread stream, and discuss their synchronization behaviors in different scenarios.
Default Stream and Non-Default Blocking Stream
In the following example, I created a non-default blocking stream using cudaStreamCreate. For a series of CUDA kernels that is supposed to be run in sequence on the same non-default blocking CUDA stream, I made a mistake and accidentally used the default stream for one of the kernels.
If the default stream is a default legacy stream, when an action is taken in the legacy stream such as a kernel launch or cudaStreamWaitEvent(), the legacy stream first waits on all blocking streams, the action is queued in the legacy stream, and then all blocking streams wait on the legacy stream. Therefore, even if I made a mistake, the CUDA kernels are still run in sequence and the correctness of the application is not affected.
If the default stream is a default per-thread stream, it is non-blocking and will not synchronize with other CUDA streams. Therefore, my mistake will cause the application to run incorrectly.
boolcheck_array_value(int32_tconst* data, uint32_t n, int32_t val) { for (uint32_t i{0}; i < n; ++i) { if (data[i] != val) { returnfalse; } } returntrue; }
intmain() { constexpruint32_tconst n{1000000}; constexprint32_tconst val_1{1}; constexprint32_tconst val_2{2}; constexprint32_tconst val_3{3}; // Create an multi-stream application. cudaStream_t stream_1{0}; cudaStream_t stream_2{0}; // stream_1 is a non-default blocking stream. CHECK_CUDA_ERROR(cudaStreamCreate(&stream_1));
std::vector<int32_t> vec(n, 0); int32_t* d_data{nullptr}; CHECK_CUDA_ERROR(cudaMalloc(&d_data, n * sizeof(int32_t))); CHECK_CUDA_ERROR(cudaMemcpy(d_data, vec.data(), n * sizeof(int32_t), cudaMemcpyHostToDevice)); // Run a sequence of CUDA kernels in order on the same CUDA stream. launch_add_val_in_place(d_data, val_1, n, stream_1); // The second kernel launch is supposed to be run on stream_1. // However, the implementation has a typo such that the kernel launch // is run on the default stream_2. launch_add_val_in_place(d_data, val_2, n, stream_2); launch_add_val_in_place(d_data, val_3, n, stream_1);
CHECK_CUDA_ERROR(cudaStreamSynchronize(stream_1)); CHECK_CUDA_ERROR(cudaMemcpy(vec.data(), d_data, n * sizeof(int32_t), cudaMemcpyDeviceToHost));
// Check the correctness of the application. // Yet the result will still be correct if the default stream_2 // is a legacy default stream. assert(check_array_value(vec.data(), n, val_1 + val_2 + val_3));
Depending on the use cases, this kind of mistake sometimes may affect application performance. It could usually be identified using CUDA profiling software, such as Nsight Systems.
However, if the default stream becomes per-thread, the result is no longer correct, because the kernel launch are no longer issued in sequence.
Default Stream and Non-Default Non-Blocking Stream
In some applications, the non-default stream can be created using cudaStreamCreateWithFlags and the non-default stream created becomes non-blocking. In this case, the default stream, even if it is the default legacy stream, cannot synchronize with the non-default non-blocking stream. Therefore, Therefore, my mistake will cause the application to run incorrectly, regardless whether the non-default stream is legacy or per-thread.
boolcheck_array_value(int32_tconst* data, uint32_t n, int32_t val) { for (uint32_t i{0}; i < n; ++i) { if (data[i] != val) { returnfalse; } } returntrue; }
intmain() { constexpruint32_tconst n{1000000}; constexprint32_tconst val_1{1}; constexprint32_tconst val_2{2}; constexprint32_tconst val_3{3}; // Create an multi-stream application. cudaStream_t stream_1{0}; cudaStream_t stream_2{0}; // stream_1 is a non-default non-blocking stream. CHECK_CUDA_ERROR(cudaStreamCreateWithFlags(&stream_1, cudaStreamNonBlocking));
std::vector<int32_t> vec(n, 0); int32_t* d_data{nullptr}; CHECK_CUDA_ERROR(cudaMalloc(&d_data, n * sizeof(int32_t))); CHECK_CUDA_ERROR(cudaMemcpy(d_data, vec.data(), n * sizeof(int32_t), cudaMemcpyHostToDevice)); // Run a sequence of CUDA kernels in order on the same CUDA stream. launch_add_val_in_place(d_data, val_1, n, stream_1); // The second kernel launch is supposed to be run on stream_1. // However, the implementation has a typo so that the kernel launch // is run on the default stream_2. launch_add_val_in_place(d_data, val_2, n, stream_2); launch_add_val_in_place(d_data, val_3, n, stream_1);
CHECK_CUDA_ERROR(cudaStreamSynchronize(stream_1)); CHECK_CUDA_ERROR(cudaMemcpy(vec.data(), d_data, n * sizeof(int32_t), cudaMemcpyDeviceToHost));
// Check the correctness of the application. // Yet the result will still be correct if the default stream_2 // is a legacy default stream. assert(check_array_value(vec.data(), n, val_1 + val_2 + val_3));