Roofline Performance Model
Introduction
The roofline performance model offers an intuitive and insightful way to compare application performance against machine capabilities, track progress towards optimality, and identify bottlenecks, inefficiencies, and limitations in software implementations and architecture designs.
In this blog post, I would like to quickly discuss the roofline performance model and how it can be used to understand the performance gap between the theoretical peak performance and the actual performance of an application.
Roofline Performance Model
In a typical roofline model, the x-axis represents the arithmetic intensity, which is the ratio of the number of floating-point operations to the number of bytes accessed from memory, and it is measured in FLOP/byte. The y-axis represents the computational performance, which is measured in FLOP/s.
The roofline consists of two lines: the memory bandwidth boundary and the peak performance boundary. The memory bandwidth boundary is the sloped line whose slope is equal to the memory bandwidth of the system. The peak performance boundary is the horizontal line whose height is equal to the peak performance of the system. Each point on the roofline represents the maximum performance that can be possibly achieved by an application given its arithmetic intensity.
To understand this, we have to assume an ideal implementation in which the data transfer between the memory and the processor is perfectly overlapped with the computation. In this case, the performance of the application is limited by the memory bandwidth or the peak performance of the processor, whichever is lower. When the arithmetic intensity of the application is low, fetching the data from memory is slower than computing using the data. In this case, the application is memory-bound, and the performance of the application is linearly proportional to the arithmetic intensity of the application. When the arithmetic intensity of the application is high, computing using the data is slower than fetching the data from memory, because the peak performance of the processor has been reached. In this case, the application is compute-bound, and the performance of the application is a constant value equal to the peak performance of the processor, independent of the arithmetic intensity of the application.
The roofline model can be depicted completely from the systems specifications, including the peak performance of the processor and the memory bandwidth of the system. The arithmetic intensity of an application can be measured theoretically or experimentally. The performance of the application can only be measured experimentally. The point of performance and arithmetic intensity of any real application, sometimes referred as the achieved value, will not go above the roofline. This is because the utilization of the memory bandwidth and/or the peak performance of the processor is usually suboptimal, especially for complex applications. Given an underutilized memory bandwidth and/or an underutilized peak performance of the processor, we could depict a new roofline that is lower than the original roofline, which is called the discounted roofline. The achieved value of a real application will stay on the discounted roofline if the underutilized memory bandwidth and/or the underutilized peak performance of the processor is estimated correctly.
On the CUDA platform, there are many reasons why the memory bandwidth and/or the peak performance of the processor are underutilized. If the memory access is not coalesced, the effective memory bandwidth becomes discounted. If the desired processor for specific numerical data type, such as FP8 Tensor Core, is not utilized, the desired peak performance of the processor will never be achieved.
Given an achieved value of a real application under the roofline, by just looking at the roofline chart, it is often impossible to determine why the application did not achieve the highest performance possible on the roofline. It can be due to the memory bandwidth is underutilized, the peak performance of the processor is underutilized, or both.
References
Roofline Performance Model