CUDA Compilation Architecture Macro
Introduction
In C++, macros are often used for controlling the code for compilation for difference use cases. Similarly, in CUDA, it is often necessary to compile the same source code file for different GPU architectures.
In this blog post, I would like to quickly discuss how to use NVCC compilation architecture macro to control the compilation for different GPU architectures.
Half Addition Example
According to the CUDA arithmetic instructions, FP16 add arithmetic instruction could only be performed with compute capability >= 5.3.
In this example, with architecture macro, different FP16 add implementation could be switched for different virtual GPU architectures.
No Architecture Macro
Without using architecture macro, we could not control the device side implementation for different virtual GPU architectures.
1 |
|
Although compiling the FP16 addition program against the virtual GPU architecture compute_52
did not produce compilation error, runtime sanity check shows that the the CUDA kernel has issues. Compiling the same program against the virtual GPU architecture compute_53
is fine. This is expected because FP16 add arithmetic instruction __hadd
could only be performed with virtual GPU architecture >= 5.3.
1 | $ nvcc half_addition_no_macro.cu -o half_addition_no_macro --gpu-architecture=compute_52 |
With Architecture Macro
For virtual GPU architecture < 5.3, if we care less about the performance, we could still do the FP16 addition by converting FP16 values to FP32, perform the FP32 addition, and convert the FP32 sum back to FP16. __CUDA_ARCH__
is the architecture macro representing the virtual GPU architecture.
1 |
|
In this implementation, when the virtual GPU architecture is compute_52
, __float2half(__half2float(input_1[i]) + __half2float(input_2[i]))
will be used for compilation; when the virtual GPU architecture is compute_53
, __hadd(input_1[i], input_2[i])
will be used for compilation.
1 | $ nvcc half_addition_with_macro.cu -o half_addition_with_macro --gpu-architecture=compute_52 |
Caveats
This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it. This means the __CUDA_ARCH__
macro could only live inside the functions decorated with __device__
.
In the following example, we could see that the __CUDA_ARCH__
macro is useless inside a host function.
1 |
|
1 | $ nvcc host.cu -o host --gpu-architecture=compute_52 |
References
CUDA Compilation Architecture Macro
https://leimao.github.io/blog/CUDA-Compilation-Architecture-Macro/