CUDA Occupancy Calculation

Introduction

Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps.

Higher occupancy does not always equate to higher performance-there is a point above which additional occupancy does not improve performance. However, low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation.

In this blog post, I would like to discuss about the CUDA occupancy calculation.

CUDA Occupancy Calculation

Excel Occupancy Calculator

Although the Excel based Occupancy Calculator is deprecated, we can still use it to calculate the occupancy for compute capabilities up to 8.6.

Physical Limits for GPU Compute Capability

From the “Physical Limits for GPU Compute Capability” section in the Excel sheet, we could see that, for example, on devices of compute capability 7.0, each multiprocessor has 65,536 32-bit registers and can have a maximum of 2048 simultaneous threads resident (64 warps x 32 threads per warp). Register allocations are rounded up to the nearest 256 registers per block on devices with compute capability 7.0. The warp allocation granularity is 4.

These are the key factors for computing the occupancy.

Occupancy Calculation Example

Let’s calculate the occupancies of some examples manually.

For example, on a device of compute capability 7.0, consider a kernel with 128-thread blocks using 37 registers per thread.

We know from the “Physical Limits for GPU Compute Capability” that the maximum number of possible active warps is 64 for compute capability 7.0.

The number of registers required for one warp is

$$
\left\lceil \frac{37 \times 32}{256} \right\rceil \times 256 = 1280
$$

where $32$ is the number of threads per warp and $256$ is the register allocation unit size.

The number of maximum active warps per multiprocessor given the warp allocation granularity is

$$
\left\lfloor \frac{65536 / 1280}{4} \right\rfloor \times 4 = 48
$$

Because an 128-thread block consists of $128 / 32 = 4$ warps, we can run at most $48 / 4 = 12$ thread blocks. Therefore, the occupancy is $12 * 4 / 64 = 75\%$.

Consider another example, on a device of compute capability 7.0, a kernel with 320-thread blocks using 37 registers per thread.

Because a 320-thread block consists of $320 / 32 = 10$ warps, we can run at most $48 / 10 = 4$ thread blocks. Therefore, the occupancy is $10 * 4 / 64 = 63\%$.

These manually calculated occupancies could be verified using the Excel based occupancy calculator.

Number of Registers

Finally, we could get the number of registers used per thread for each kernel using the --ptxas-options=-v option of nvcc. For example,

1
2
3
4
5
6
7
$ wget https://raw.githubusercontent.com/NVIDIA-developer-blog/code-samples/master/series/cuda-cpp/overlap-data-transfers/async.cu
$ nvcc async.cu -o async --ptxas-options=-v
ptxas info : 24 bytes gmem
ptxas info : Compiling entry function '_Z6kernelPfi' for 'sm_52'
ptxas info : Function properties for _Z6kernelPfi
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 22 registers, 332 bytes cmem[0], 48 bytes cmem[2]

Conclusion

Calculating occupancy manually is sometimes tedious and brain-twisting. However, we could use the existing tools, such as the Excel based occupancy calculator, to do the calculation.

References

Author

Lei Mao

Posted on

06-25-2022

Updated on

06-25-2022

Licensed under


Comments