CUDA Occupancy Calculation

06-25-202212-16-2024 blog 3 minutes read (About 504 words) visits

Introduction

Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps.

Higher occupancy does not always equate to higher performance - there is a point above which additional occupancy does not improve performance. However, low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation.

In this blog post, I would like to discuss the CUDA occupancy calculation.

CUDA Occupancy Calculation

Excel Occupancy Calculator

Although the Excel based Occupancy Calculator is deprecated, we can still use it to calculate the occupancy for compute capabilities up to 8.6.

Physical Limits for GPU Compute Capability

From the “Physical Limits for GPU Compute Capability” section in the Excel sheet, we could see that, for example, on devices of compute capability 7.0, each multiprocessor has 65,536 32-bit registers and can have a maximum of 2048 simultaneous threads resident (64 warps x 32 threads per warp). Register allocations are rounded up to the nearest 256 registers per block on devices with compute capability 7.0. The warp allocation granularity is 4.

These are the key factors for computing the occupancy.

Occupancy Calculation Example

Let’s calculate the occupancies of some examples manually.

For example, on a device of compute capability 7.0, consider a kernel with 128-thread blocks using 37 registers per thread.

We know from the “Physical Limits for GPU Compute Capability” that the maximum number of possible active warps is 64 for compute capability 7.0.

The number of registers required for one warp is

$$
\left\lceil \frac{37 \times 32}{256} \right\rceil \times 256 = 1280
$$

where $32$ is the number of threads per warp and $256$ is the register allocation unit size.

The number of maximum active warps per multiprocessor given the warp allocation granularity is

$$
\left\lfloor \frac{65536 / 1280}{4} \right\rfloor \times 4 = 48
$$

Because an 128-thread block consists of $128 / 32 = 4$ warps, we can run at most $48 / 4 = 12$ thread blocks. Therefore, the occupancy is $12 * 4 / 64 = 75\%$.

Consider another example, on a device of compute capability 7.0, a kernel with 320-thread blocks using 37 registers per thread.

Because a 320-thread block consists of $320 / 32 = 10$ warps, we can run at most $48 / 10 = 4$ thread blocks. Therefore, the occupancy is $10 * 4 / 64 = 63\%$.

These manually calculated occupancies could be verified using the Excel based occupancy calculator.

Number of Registers

Finally, we could get the number of registers used per thread for each kernel using the --ptxas-options=-v option of nvcc. For example,

$ wget https://raw.githubusercontent.com/NVIDIA-developer-blog/code-samples/master/series/cuda-cpp/overlap-data-transfers/async.cu
$ nvcc async.cu -o async --ptxas-options=-v
ptxas info    : 24 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelPfi' for 'sm_52'
ptxas info    : Function properties for _Z6kernelPfi
    32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 22 registers, 332 bytes cmem[0], 48 bytes cmem[2]

Conclusion

Calculating occupancy manually is sometimes tedious and brain-twisting. However, we could use the existing tools, such as the Excel based occupancy calculator, to do the calculation.

References

CUDA Occupancy Calculation

https://leimao.github.io/blog/CUDA-Occupancy-Calculation/

Author

Lei Mao

Posted on

06-25-2022

Updated on

12-16-2024

Licensed under

CUDA

CUDA Occupancy Calculation

Introduction

CUDA Occupancy Calculation

Excel Occupancy Calculator

Physical Limits for GPU Compute Capability

Occupancy Calculation Example

Number of Registers

Conclusion

References

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Advertisement

Catalogue