Proper CUDA Error Checking

Introduction

Proper CUDA error checking is critical for making the CUDA program development smooth and successful. Missing or incorrectly identifying CUDA errors could cause problems in production or waste lots of time in debugging.

In this blog post, I would like to quickly discuss proper CUDA error checking.

CUDA Error Types

CUDA errors could be separated into synchronous and asynchronous errors, or sticky and non-sticky errors.

Synchronous Error VS Asynchronous Error

CUDA kernel launch is asynchronous, meaning when the host thread reaches the code for kernel launch, say kernel<<<...>>>, the host thread issues an request to execute the kernel on GPU, then the host thread that launches the kernel continues, without waiting for the kernel to complete. The kernel might not begin to execute right away either.

There could be two types of error for CUDA kernel launch, synchronous error and asynchronous error.

Synchronous error happens when the host thread knows the kernel is illegal or invalid. For example, when the thread block size or grid size is too large, a synchronous error is resulted immediately after the kernel launch call, and this error could be captured by CUDA runtime error capturing API calls, such as cudaGetLastError, right after the kernel launch call.

Asynchronous error happens during kernel execution or CUDA runtime asynchronous API execution on GPU. It might take a while to encounter the error and send the error to host thread. For example, For example, it might encounter accessing invalid memory address in the late stage of kernel execution or CUDA runtime asynchronous API cudaMemcpyAsync execution, it will abort the execution and then send the error back to thread. Even if there are CUDA runtime error capturing API calls, such as cudaGetLastError, right after the kernel launch call, at the time when the error reaches host, those CUDA runtime error capturing API calls have been executed and they found no error. It is possible to capture the asynchronous error by explicitly synchronizing using the CUDA kernel launch using CUDA runtime API calls, such as cudaDeviceSynchronize, cudaStreamSynchronize, or cudaEventSynchronize, and checking the returned error from those CUDA kernel launch using CUDA runtime API calls or capturing the error using CUDA runtime error capturing API calls, such as cudaGetLastError. However, explicitly synchronization usually affects performance and therefore is not recommended for using in production unless it is extremely necessary.

Sticky VS Non-Sticky Error

CUDA runtime API returns non-sticky error if there is any, whereas CUDA kernel execution resulted in sticky error if there is any.

A non-sticky error is recoverable, meaning subsequent CUDA runtime API calls could behave normally. Therefore, the CUDA context is not corrupted. For example, when we allocate memory using cudaMalloc, it will return a non-sticky error if the GPU memory is insufficient.

A sticky error is not recoverable, meaning subsequent CUDA runtime API calls will always return the same error. Therefore, the CUDA context is corrupted, unless the application host process is terminated. For example, when the kernel tries to access invalid memory address during kernel execution, it will result in a sticky error which will be captured and returned by all the subsequent CUDA runtime API calls.

CUDA Error Checking Best Practice

In a CUDA program implementation, both development and production code, always check the return value of each CUDA runtime synchronous or asynchronous API call to see if there is any CUDA synchronous error, always run CUDA runtime error capturing API calls, such as cudaGetLastError, after kernel launch calls to see if there is any CUDA synchronous error. Check CUDA asynchronous error in development by synchronization and error checking after kernel launch calls and disable it in production.

Quiz

There is a question on the NVIDIA developer forum. Let’s use it as a quiz. Basically, the user has the following code. All calculations are done on the default stream and one thread. The cudaDeviceSynchronize returns cudaSuccess, but the cudaGetLastError call returns an invalid device function error. How would this happen?

1
2
3
4
5
6
7
// do some stuff, launch kernels, etc

res = cudaDeviceSynchronize();
// check res

res = cudaGetLastError();
// check res

cudaGetLastError returns the last error that has been produced by any of the runtime calls in the same host thread and resets it to cudaSuccess. cudaDeviceSynchronize is a CUDA runtime API call and it got no error. This means the kernel launch got no asynchronous error. However, there could be errors from CUDA runtime API calls prior to launching the kernel or the kernel launching encountered synchronous error which have not been properly error-checked. The last error that produced by those would not be reset until the cudaGetLastError call, even though before the reset there were cudaSuccess from other CUDA runtime API calls.

For example,

last_error.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#include <cuda_runtime.h>
#include <iostream>

#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
void check(cudaError_t err, const char* const func, const char* const file,
const int line)
{
if (err != cudaSuccess)
{
std::cerr << "CUDA Runtime Error at: " << file << ":" << line
<< std::endl;
std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
// We don't exit when we encounter CUDA errors in this example.
// std::exit(EXIT_FAILURE);
}
}

#define CHECK_LAST_CUDA_ERROR() checkLast(__FILE__, __LINE__)
void checkLast(const char* const file, const int line)
{
cudaError_t const err{cudaGetLastError()};
if (err != cudaSuccess)
{
std::cerr << "CUDA Runtime Error at: " << file << ":" << line
<< std::endl;
std::cerr << cudaGetErrorString(err) << std::endl;
// We don't exit when we encounter CUDA errors in this example.
// std::exit(EXIT_FAILURE);
}
}

int main()
{
float* p;
// This will produce error.
CHECK_CUDA_ERROR(cudaMalloc(&p, 1000000000000000 * sizeof(float)));
// This will be successful.
CHECK_CUDA_ERROR(cudaMalloc(&p, 10 * sizeof(float)));
// This will be successful.
CHECK_CUDA_ERROR(cudaFree(p));
// The last error still has not been reset here.
// This will produce the same error as
// cudaMalloc(&p, 1000000000000000 * sizeof(float))
CHECK_LAST_CUDA_ERROR();
// The last error has been reset here.
CHECK_LAST_CUDA_ERROR();
}
1
2
3
4
5
6
$ nvcc last_error.cu -o last_error
$ ./last_error
CUDA Runtime Error at: last_error.cu:37
out of memory cudaMalloc(&p, 1000000000000000 * sizeof(float))
CUDA Runtime Error at: last_error.cu:45
out of memory

Fundamentally, it was due to that the CUDA program error checking was not following the best practice mentioned previously.

References

Author

Lei Mao

Posted on

05-25-2022

Updated on

12-15-2023

Licensed under


Comments