Benchmarking NVIDIA Tensor Core MMA Instruction Peak Performances
Introduction
Whenever NVIDIA releases a new GPU architecture or a new GPU SKU, they will always advertise the peak AI performances of the GPU using the number of operations per second (OPS) metric, usually in units of TFLOPS (Tera Floating Point Operations Per Second) or TOPS (Tera Operations Per Second). The peak performances, especially those for the low precision data types, such as TF32, BF16, FP16, INT8, FP8, INT4, and FP4, come from the Tensor Core MMA (Matrix Multiply-Accumulate) instructions on the GPU.
Reproducing the advertised peak performances using HPC software or benchmarking tools does not always work, because it is completely possible that those software or benchmarking tools have not fully supported the architecture features of the GPU being tested. Even if the software is from NVIDIA itself, such as cuBLAS, the software might not have been fully optimized for the latest GPU architecture or SKU yet, when the GPU is delivered to end users.
Instead of relying on third-party software or benchmarking tools, a more reliable way to measure the peak performances of the Tensor Core MMA instructions is to write custom micro-benchmarks that directly invoke those MMA instructions. However, there are a lot of MMA instructions on NVIDIA GPUs and the performance of each instruction can vary significantly depending on the data types, matrix sizes, and other factors. Therefore, even if custom micro-benchmarks are created, it is still possible that some key MMA instructions are missed and the advertised peak AI performances cannot be reproduced.
In this blog post, I would like to demonstrate how to measure the peak performances of NVIDIA Tensor Core MMA instructions using CUTLASS and CuTe. The advertised peak AI performances of NVIDIA GPUs can be fully reproduced on my NVIDIA RTX 5080 GPU using this method. In addition, this can serve as a good reference for picking the right and performant Tensor Core MMA instructions for AI software development on NVIDIA GPUs.
CuTe MMA Performance Benchmark
CuTe MMA Atoms
Most of the widely used and performant NVIDIA Tensor Core MMA instructions are wrapped as MMA Atoms in CuTe and saved in the directory of include/cute/arch for different GPU architectures. For example, the SM80 FP16 Tensor Core instruction mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 is wrapped as the SM80_16x8x16_F16F16F16F16_TN MMA Atom in CuTe as shown below.
1 | // MMA 16x8x16 TN |
Dummy MMA Kernel for Performance Benchmark
Implementing a high-performance MMA kernel that achieves the peak performances is usually not a trivial task. However, for the purpose of measuring the peak performances of the MMA instructions, we can implement a dummy MMA kernel that simply invokes the MMA instructions in a loop without having to load or store any data between the processor and the global memory. In this way, we can isolate the performance of the MMA instructions themselves without being affected by memory bandwidth or other factors. For example, to execute the SM80 MMA instructions in a loop, we can implement a generic MMA benchmark kernel using template expansion as shown below.
1 | // Template helpers to determine array sizes |
The benchmark implementations have been created in the cute_mma_benchmark directory in the CUTLASS Examples GitHub repository. To build the benchmark, please follow the instructions in the README.
CuTe MMA Performance Benchmark Results
I am using an NVIDIA RTX 5080 GPU, whose compute capability is SM 12.0, to run the CuTe MMA performance benchmark. The GPU has the following specifications according to the NVIDIA RTX Blackwell GPU Architecture Whitepaper.
| Data Type | Accumulate Type | Peak Performance (TFLOPS/TOPS) |
|---|---|---|
| FP4 Tensor | FP32 | 900.4 / 1801 |
| FP8 Tensor | FP16 | 450.2 / 900.4 |
| FP8 Tensor | FP32 | 225.1 / 450.2 |
| FP16 Tensor | FP16 | 225.1 / 450.2 |
| FP16 Tensor | FP32 | 112.6 / 225.1 |
| BF16 Tensor | FP32 | 112.6 / 225.1 |
| TF32 Tensor | - | 56.3 / 112.6 |
| INT8 Tensor | - | 450.2 / 900.4 |
Not every MMA instruction that matches the data type and accumulate type combination listed in the table above can achieve the advertised peak performances. However, by exhaustively benchmarking all the available MMA Atoms in CuTe that match the data type and accumulate type combinations, I was able to find several MMA instructions that can achieve the advertised peak performances.
For example, we have the following dense MMA FP4 performance from a few MMA instructions. Only SM120_16x8x64_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue8m0_t, 32> and SM120_16x8x64_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue4m3_t, 32> can achieve the advertised peak performance of 900.4 TOPS for FP4 Tensor with FP32 Accumulate Type on the NVIDIA RTX 5080 GPU.
| MMA Instruction | Performance (TFLOPS) |
|---|---|
SM120_16x8x32_TN<float_e2m1_t, float_e2m1_t, float> |
238.536 |
SM120_16x8x32_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue8m0_t, 32> |
238.528 |
SM120_16x8x64_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue8m0_t, 32> |
933.118 |
SM120_16x8x64_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue4m3_t, 32> |
923.104 |
We have the following sparse MMA FP4 performance from a few MMA instructions. SM120_SPARSE_16x8x128_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue8m0_t, 64>, SM120_SPARSE_16x8x128_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue8m0_t, 32>, and SM120_SPARSE_16x8x128_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue4m3_t, 32> can achieve the advertised peak performance of 1801 TOPS for sparse FP4 Tensor with FP32 Accumulate Type on the NVIDIA RTX 5080 GPU.
Similarly, we have the following sparse MMA FP4 performance from a few MMA instructions. SM120_SPARSE_16x8x128_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue8m0_t, 64>, SM120_SPARSE_16x8x128_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue8m0_t, 32>, and SM120_SPARSE_16x8x128_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue4m3_t, 32> can achieve the advertised peak performance of 1801 TOPS for sparse FP4 Tensor with FP32 Accumulate Type on the NVIDIA RTX 5080 GPU.
| MMA Instruction | Performance (TFLOPS) |
|---|---|
SM120_SPARSE_16x8x64_TN<float_e2m1_t, float_e2m1_t, float> |
475.522 |
SM120_SPARSE_16x8x64_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue8m0_t, 64> |
475.769 |
SM120_SPARSE_16x8x128_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue8m0_t, 64> |
1872.202 |
SM120_SPARSE_16x8x128_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue8m0_t, 32> |
1866.996 |
SM120_SPARSE_16x8x128_TN_VS<float_e2m1_t, float_e2m1_t, float, float_ue4m3_t, 32> |
1866.996 |
Conclusions
Not every MMA instruction that matches the data type and accumulate type combination can achieve the advertised peak performances on NVIDIA GPUs. When it comes to implementing high-performance CUDA kernels that leverage Tensor Core MMA instructions, it is important to pick the right MMA instructions that can achieve the best performance for the target GPU architecture and SKU. The CuTe MMA performance benchmark presented in this blog post can serve as a useful reference for selecting the appropriate MMA instructions for high-performance CUDA kernel development on NVIDIA GPUs.
References
Benchmarking NVIDIA Tensor Core MMA Instruction Peak Performances
https://leimao.github.io/blog/Benchmarking-NVIDIA-Tensor-Core-MMA-Peak-Performances/