Since I have limited knowledge about computer hardware and system, as a new NVIDIA employee, I would like to catch up quickly. So the first thing I need to learn is graphic cards, although I have been using it during my daily work without totally understanding how it works. The bottom line is to at least understand the specifications of a graphic card, and know what you can do with such a graphic card.
This blog post might contain stupid errors since I am also learning. So please correct me if I made any mistakes.
NVIDIA RTX 2080 Ti and AMD VEGA 64, which are currently the best consumer-focused graphic cards from NVIDIA and AMD gaming platforms respectively, were used for the analysis and comparison in this blog post.
The specifications were pulled from NVIDIA and AMD official websites, and TechPowerUp.
|Specs||NVIDIA RTX 2080 Ti||AMD VEGA 64|
|Cores / Processors||4352||4096|
|Base Clock||1350 MHz||1247 MHz|
|Boost Clock||1635(OC) MHz||1546 MHz|
|Process Size||12 nm||14 nm|
|Memory Bandwidth||616 GB/s||484 GB/s|
|Memory Size||11 GB||8 GB|
|Texture Mapping Units||272||256|
|Render Output Processors||88||64|
|Ray Tracing Cores||68||None|
|Single Precision Performance||13.4 TFLOPS||12.5 TFLOPS|
|Texture Rate||420.2 GT/s||393.2 GT/s|
|Pixel Rate||136.0 GP/s||98.30 GP/s|
CUDA Cores / Stream Processors
CUDA (Compute Unified Device Architecture) cores of NVIDIA GPUs, which corresponds to “Stream Processors” of AMD GPUs, are the processing unit of GPU. Multiple CUDA cores contribute to the parallel processing of a task on GPU. For the GPUs from the same generation or architecture, more CUDA cores usually means higher computation performance. But this relationship does not hold between different GPU generations or architectures, due to the internal implementation of CUDA cores can be different. For the same reason, the number of CUDA cores on NVIDIA GPU is also not comparable to the number of stream processors on AMD GPU.
Base Clock / Boost Clock
The concept of base clock, or base frequency, is similar to CPU frequency. The higher the frequency is, the faster the GPU process the task. Similarly, boost clock is similar to the turbo frequency on Intel CPUs. Basically, when GPU knows there is a large task coming, it will automatically increase to a higher frequency in order to process the task faster.
When we look at CPUs, we find that the process size of CPUs becomes smaller and smaller as new generations of CPUs were born. Smaller process size means that more transistors could fit in a given space. While the basic functionality preserves, the energy consumption, and production cost also drop. It should be noted that we ignored the quantum tunnelling effect in this discussion.
GPU memory is also a critical part of a graphic card, because GPU directly communicates with it.
NVIDIA uses GDDR6 memory for RTX 2080 Ti, while AMD uses HBM2 memory for VEGA 64.
GDDR stands for “Graphics Double Data Rate”, and HBM stands for “High Bandwidth Memory”. GDDR is usually cheaper and easier to manufacture than HBM, while HBM has, of course, higher maximum bandwidth and lower power consumption. One GDDR6 chip usually uses 32-bit bus width. One HBM2 stack consists of at most 8 stacked DRAM dies. Each DRAM die has 128-bit bus width.
It should be noted that NVIDIA does use HBM2 memory in its high-end data center cards such as Tesla V100.
Memory interface (memory bus) essentially determines how many memory chips that could connect to the GPU. If the memory interface is 352-bit, and each memory chip has an interface of 32-bit. This means that this GPU could connect to at most 11 memory chips. So I could infer that RTX 2080 Ti uses 11 1-GB GDDR6 memory chips.
While HBM does provide much higher maximum bandwidth (bandwidth ceiling) compared to GDDR, we found the actual memory bandwidth per GB of memory of AMD VEGA 64 (60.5/s) is not much better than that of RTX 2080 Ti (56/s). So I think using expensive HBM2 memory on AMD VEGA 64 might be a over-kill, and the HBM2 bandwidth really depends on its internal implementation.
Texture Mapping Units
A texture mapping units (TMUs) is able to rotate, resize, and distort a bitmap image (performing texture sampling). It is reasonable to assume that the card with more TMUs will be faster at processing texture information.
Render Output Processors
The render output processors (ROPs), also known as raster operation processors are responsible for writing pixel data to memory. The speed at which this is done is known as the fill rate. While the job of the ROPs is important, it is not really a performance bottleneck as much as it once was, and is not used as a relative performance indicator to good effect at this time.
Tensor cores are basically programmable matrix-multiply-and-accumulate units that accelerate deep learning training and inference, invented by NVIDIA, providing up to 500 trillion tensor operations a second. Tensor cores used to be only available on NVIDIA high-end Tesla V100 and Titan V. However, in Turing architecture, all of the GeForce RTX 20 series graphic cards have tensor cores, even including RTX 2060.
AMD graphic cards do not have such or similar modules, to the best of my knowledge.
Ray Tracing Cores
The most exciting feature of NVIDIA RTX 20 series graphic cards is ray tracing. The ray-tracing cores are of course used to accelerate the real-time ray tracing algorithm computation.
Single Precision Performance
Single precision performance is an indication of how fast a graphic card could do single precision (float 32) operations. Normally when I use TensorFlow, the datatypes I used for tensors are often single precision.
Texture rate is the maximum number of texture map elements that can be processed per second. Higher the texture rate, the faster the game renders displays demanding games fluently. I think this is an indication of whether the graphic card could support high refresh rate monitors or not.
Pixel rate is the maximum amount of pixels the GPU could possibly write to the local memory in one second. Higher the pixel rate, the higher is the screen resolution the GPU could support. I think this is an indication of whether the graphic card could support high-resolution monitors or not.