CUDA Occupancy Calculation
Introduction
Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps.
Higher occupancy does not always equate to higher performance - there is a point above which additional occupancy does not improve performance. However, low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation.
In this blog post, I would like to discuss the CUDA occupancy calculation.
CUDA Occupancy Calculation
Excel Occupancy Calculator
Although the Excel based Occupancy Calculator is deprecated, we can still use it to calculate the occupancy for compute capabilities up to 8.6.
Physical Limits for GPU Compute Capability
From the “Physical Limits for GPU Compute Capability” section in the Excel sheet, we could see that, for example, on devices of compute capability 7.0, each multiprocessor has 65,536 32-bit registers and can have a maximum of 2048 simultaneous threads resident (64 warps x 32 threads per warp). Register allocations are rounded up to the nearest 256 registers per block on devices with compute capability 7.0. The warp allocation granularity is 4.
These are the key factors for computing the occupancy.
Occupancy Calculation Example
Let’s calculate the occupancies of some examples manually.
For example, on a device of compute capability 7.0, consider a kernel with 128-thread blocks using 37 registers per thread.
We know from the “Physical Limits for GPU Compute Capability” that the maximum number of possible active warps is 64 for compute capability 7.0.
The number of registers required for one warp is
$$
\left\lceil \frac{37 \times 32}{256} \right\rceil \times 256 = 1280
$$
where $32$ is the number of threads per warp and $256$ is the register allocation unit size.
The number of maximum active warps per multiprocessor given the warp allocation granularity is
$$
\left\lfloor \frac{65536 / 1280}{4} \right\rfloor \times 4 = 48
$$
Because an 128-thread block consists of $128 / 32 = 4$ warps, we can run at most $48 / 4 = 12$ thread blocks. Therefore, the occupancy is $12 * 4 / 64 = 75\%$.
Consider another example, on a device of compute capability 7.0, a kernel with 320-thread blocks using 37 registers per thread.
Because a 320-thread block consists of $320 / 32 = 10$ warps, we can run at most $48 / 10 = 4$ thread blocks. Therefore, the occupancy is $10 * 4 / 64 = 63\%$.
These manually calculated occupancies could be verified using the Excel based occupancy calculator.
Number of Registers
Finally, we could get the number of registers used per thread for each kernel using the --ptxas-options=-v
option of nvcc
. For example,
1 | $ wget https://raw.githubusercontent.com/NVIDIA-developer-blog/code-samples/master/series/cuda-cpp/overlap-data-transfers/async.cu |
Conclusion
Calculating occupancy manually is sometimes tedious and brain-twisting. However, we could use the existing tools, such as the Excel based occupancy calculator, to do the calculation.
References
CUDA Occupancy Calculation