Proper CUDA Error Checking
Proper CUDA error checking is critical for making the CUDA program development smooth and successful. Missing or incorrectly identifying CUDA errors could cause problems in production or waste lots of time in debugging.
In this blog post, I would like to quickly discuss proper CUDA error checking.
CUDA errors could be separated into synchronous and asynchronous errors, or sticky and non-sticky errors.
CUDA kernel launch is asynchronous, meaning when the host thread reaches the code for kernel launch, say
kernel<<<...>>>, the host thread issues an request to execute the kernel on GPU, then the host thread that launches the kernel continues, without waiting for the kernel to complete. The kernel might not begin to execute right away either.
There could be two types of error for CUDA kernel launch, synchronous error and asynchronous error.
Synchronous error happens when the host thread knows the kernel is illegal or invalid. For example, when the thread block size or grid size is too large, a synchronous error is resulted immediately after the kernel launch call, and this error could be captured by CUDA runtime error capturing API calls, such as
cudaGetLastError, right after the kernel launch call.
Asynchronous error happens during kernel execution or CUDA runtime asynchronous API execution on GPU. It might take a while to encounter the error and send the error to host thread. For example, For example, it might encounter accessing invalid memory address in the late stage of kernel execution or CUDA runtime asynchronous API
cudaMemcpyAsync execution, it will abort the execution and then send the error back to thread. Even if there are CUDA runtime error capturing API calls, such as
cudaGetLastError, right after the kernel launch call, at the time when the error reaches host, those CUDA runtime error capturing API calls have been executed and they found no error. It is possible to capture the asynchronous error by explicitly synchronizing using the CUDA kernel launch using CUDA runtime API calls, such as
cudaEventSynchronize, and checking the returned error from those CUDA kernel launch using CUDA runtime API calls or capturing the error using CUDA runtime error capturing API calls, such as
cudaGetLastError. However, explicitly synchronization usually affects performance and therefore is not recommended for using in production unless it is extremely necessary.
CUDA runtime API returns non-sticky error if there is any, whereas CUDA kernel execution resulted in sticky error if there is any.
A non-sticky error is recoverable, meaning subsequent CUDA runtime API calls could behave normally. Therefore, the CUDA context is not corrupted. For example, when we allocate memory using
cudaMalloc, it will return a non-sticky error if the GPU memory is insufficient.
A sticky error is not recoverable, meaning subsequent CUDA runtime API calls will always return the same error. Therefore, the CUDA context is corrupted, unless the application host process is terminated. For example, when the kernel tries to access invalid memory address during kernel execution, it will result in a sticky error which will be captured and returned by all the subsequent CUDA runtime API calls.
In a CUDA program implementation, both development and production code, always check the return value of each CUDA runtime synchronous or asynchronous API call to see if there is any CUDA synchronous error, always run CUDA runtime error capturing API calls, such as
cudaGetLastError, after kernel launch calls to see if there is any CUDA synchronous error. Check CUDA asynchronous error in development by synchronization and error checking after kernel launch calls and disable it in production.
There is a question on the NVIDIA developer forum. Let’s use it as a quiz. Basically, the user has the following code. All calculations are done on the default stream and one thread. The
cudaSuccess, but the
cudaGetLastError call returns an invalid device function error. How would this happen?
// do some stuff, launch kernels, etc
cudaGetLastError returns the last error that has been produced by any of the runtime calls in the same host thread and resets it to
cudaDeviceSynchronize is a CUDA runtime API call and it got no error. This means the kernel launch got no asynchronous error. However, there could be errors from CUDA runtime API calls prior to launching the kernel or the kernel launching encountered synchronous error which have not been properly error-checked. The last error that produced by those would not be reset until the
cudaGetLastError call, even though before the reset there were
cudaSuccess from other CUDA runtime API calls.
$ nvcc last_error.cu -o last_error
Fundamentally, it was due to that the CUDA program error checking was not following the best practice mentioned previously.
Proper CUDA Error Checking