CUDA_LAUNCH_BLOCKING=1

Introduction

When we run CUDA programs on GPUs, we would sometimes encounter asynchronous errors which are only reported after synchronization. To localize where exactly the error occurred, usually we could see two approaches:

  1. In a relatively simple application or system where there are not many kernel launches, we could put cudaDeviceSynchronize() or cudaStreamSynchronize(stream) after every kernel launch to force synchronization and error checking. We will rebuild the application or system and rerun it to see where exactly the error occurred.
  2. In a more complex application, we will just set the environment variable CUDA_LAUNCH_BLOCKING=1 to force all kernel launches to be synchronous. There is no need to rebuild the application or system. We will just rerun it to see where exactly the error occurred.

In this blog post, I would like to quickly discuss why the second approach CUDA_LAUNCH_BLOCKING=1 is favored over the first approach for debugging CUDA programs.

CUDA_LAUNCH_BLOCKING=1 Is Favored

There are some analogies of CUDA_LAUNCH_BLOCKING=1, saying that it is equivalent as if you put a cudaDeviceSynchronize() or cudaStreamSynchronize(stream) after every kernel launch, i.e., the two approaches mentioned above are equivalent. However, this is incorrect. There are scenarios where the latter approach could reveal the correct error location, while the former approach could not.

Suppose we have two CPU threads. On CPU thread 1, we launch kernel A on the CUDA stream_1 followed by a cudaStreamSynchronize(stream_1). On CPU thread 2, we launch kernel B on the CUDA stream_2 followed by a cudaStreamSynchronize(stream_2). This approach cannot guarantee that the execution of kernel A has no overlap with the execution of kernel B. For example, if the CPU instruction issue order is launch A on stream_1, launch B on stream_2, synchronize stream_1, synchronize stream_2, then the execution of kernel A and kernel B could overlap. For some simple asynchronous errors, such as illegal memory access, this approach might still reveal the correct error location since the synchronization is stream specific. However, for debugging more complex problems, such as a racing condition due to kernel A and kernel B writing to the same memory location, this approach will likely fail to help us root cause the problem. I once got a CUDA kernel which runs perfectly fine in a single-CPU-thread and single-CUDA-stream environment. However, when I run it in a multi-CPU-thread and multi-CUDA-stream environment, it will sometimes produce incorrect results. It turns out that the kernel has a global __device__ variable which will be overwritten in the CUDA kernel. Launching the kernel in multiple CPU threads and multiple CUDA streams will cause racing conditions on the global __device__ variable, hence producing incorrect results. In this case, if I could make CUDA kernel launches truly synchronous, I would be able to observe that the program produces correct results, confirming my debugging hypothesis.

Using CUDA_LAUNCH_BLOCKING=1, the execution of kernel A and kernel B will never overlap. The mental model of CUDA_LAUNCH_BLOCKING=1 is that GPU will only execute one kernel at a time, i.e., there is only one CUDA stream available, and the kernel launch call on the CPU thread will not return until the kernel finishes execution. Because of this, no matter how many CPU threads and CUDA streams we have, CUDA kernel launches will always be synchronous.

Therefore, using CUDA_LAUNCH_BLOCKING=1 should be the preferred approach to debug asynchronous errors in CUDA applications and it seems to be easier to use than the other approach as well.

References

Author

Lei Mao

Posted on

03-20-2026

Updated on

03-20-2026

Licensed under


Comments