Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence. On the Move.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

Since I have limited knowledge about computer hardware and system, as a new NVIDIA employee, I would like to catch up quickly. So the first thing I need to learn is graphic cards, although I have been using it during my daily work without totally understanding how it works. The bottom line is to at least understand the specifications of a graphic card, and know what you can do with such a graphic card.


This blog post might contain stupid errors since I am also learning. So please correct me if I made any mistakes.

Graphic Cards

NVIDIA RTX 2080 Ti and AMD VEGA 64, which are currently the best consumer-focused graphic cards from NVIDIA and AMD gaming platforms repectively, were used for the analysis and comparison in this blog post.

AMD vs NVIDIA

Specifications

The specifications were pulled from NVIDIA and AMD official websites, and TechPowerUp.

Specs NVIDIA RTX 2080 Ti AMD VEGA 64
Architecture Turing Vega
Release Date 9/20/2018 8/7/2017
Price ~$1200 ~$500
Cores / Processors 4352 4096
Base Clock 1350 MHz 1247 MHz
Boost Clock 1635(OC) MHz 1546 MHz
Process Size 12 nm 14 nm
Memory Type GDDR6 HBM2
Memory Interface 352-bit 2048-bit
Memory Bandwidth 616 GB/s 484 GB/s
Memory Size 11 GB 8 GB
Texture Mapping Units 272 256
Render Output Processors 88 64
Tensor Cores 544 None
Ray Tracing Cores 68 None
Single Precision Performance 13.4 TFLOPS 12.5 TFLOPS
Texture Rate 420.2 GT/s 393.2 GT/s
Pixel Rate 136.0 GP/s 98.30 GP/s

GPU Engines

CUDA Cores / Stream Processors

CUDA (Compute Unified Device Architecture) cores of NVIDIA GPUs, which corresponds to “Stream Processors” of AMD GPUs, are the processing unit of GPU. Multiple CUDA cores contributes to the parallel processing of a task on GPU. For the GPUs from the same generation or architecture, more CUDA cores usually means higher computation performance. But this relationship does not hold between different GPU generations or architectures, due to the internal implementation of CUDA cores can be different. For the same reason, the number of CUDA cores on NVIDIA GPU is also not comparable to the number of stream processors on AMD GPU.

Base Clock / Boost Clock

The concept of base clock, or base frequency, is similar to CPU frequency. The higher the frequency is, the faster the GPU process the task. Similarly, boost clock is similar to the turbo frequency on Intel CPUs. Basically, when GPU know there is large task coming, it will automatically increase to higher frequency in order to process the task faster.

Process Size

When we look at CPUs, we find that the process size of CPUs becomes smaller and smaller as new generations of CPUs were born. Smaller process size means that more transistors could fit in a given space. While the basic functionality preserves, the energy consumption and production cost also drop. It should be noted that we ignored the quantum tunnelling effect in this discussion.

GPU Memory

GPU memory is also a critical part of a graphic card, because GPU directly communicates with it.

Memory Type

NVIDIA uses GDDR6 memory for RTX 2080 Ti, while AMD uses HBM2 memory for VEGA 64.


GDDR stands for “Graphics Double Data Rate”, and HBM stands for “High Bandwidth Memory”. GDDR is usually cheaper and easier to manufacture than HBM, while HBM has of course higher maximum bandwidth and lower power consumptions. One GDDR6 chip usually uses 32-bit bus width. One HBM2 stack consists at most 8 stacked DRAM dies. Each DRAM die has 128-bit bus width.


It should be noted that NVIDIA does use HBM2 memory in their high-end professional graphic cards such as Tesla V100.

Memory Interface

Memory interface (memory bus) essentially determines how many memory chips that could connect to the GPU. If the memory interface is 352-bit, and each memory chip has interface of 32-bit. This means that this GPU could connect to at most 11 memory chips. So I could infer that RTX 2080 Ti uses 11 1-GB GDDR6 memory chips.

Memory Bandwith

While HBM does provide much higher maximum bandwidth (bandwidth ceiling) compared to GDDR, we found the actual memory bandwidth per GB of memory of AMD VEGA 64 (60.5/s) is not much better than that of RTX 2080 Ti (56/s). So I think using expensive HBM2 memory on AMD VEGA 64 might be a over-kill, and the HBM2 bandwidth really depends on its internal implementation.

Other Modules

Texture Mapping Units

A texture mapping units (TMUs) is able to rotate, resize, and distort a bitmap image (performing texture sampling). It is reasonable to assume that the card with more TMUs will be faster at processing texture information.

Render Output Processors

The render output processors (ROPs), also known as raster operation processors are responsible for writing pixel data to memory. The speed at which this is done is known as the fill rate. While the job of the ROPs is important, it is not really a performance bottleneck as much as it once was, and is not used as a relative performance indicator to good effect at this time.

Tensor Cores

Tensor cores are basically programmable matrix-multiply-and-accumulate units that accelerate deep learning training and inference, invented by NVIDIA, providing up to 500 trillion tensor operations a second. Tensor cores used to be only available on NVIDIA high-end Tesla V100 and Titan V. However, in Turing architecture, all of the GeForce RTX 20 series graphic cards have tensor cores, even including RTX 2060.


AMD graphic cards do not have such or similar modules, to the best of my knowledge.

Ray Tracing Cores

The most exciting feature of NVIDIA RTX 20 series graphic cards is ray tracing. The ray tracing cores are of course used to accelerate the real-time ray tracing algorithm computation.

Evaluation Metrics

Single Precision Performance

Single precision performance is an indication of how fast a graphic card could do single precision (float 32) operations. Normally when I use TensorFlow, the datatypes I used for tensors are often single precision.

Texture Rate

Texture rate is the maximum number of texture map elements that can be processed per second. Higher the texture rate, faster the game renders displays demanding games fluently. I think this is an indication of whether the graphic card could support high refresh rate monitors or not.

Pixel Rate

Pixel rate is the maximum amount of pixels the GPU could possibly write to the local memory in one second. Higher the pixel rate, higher is the screen resolution the GPU could support. I think this is an indication of whether the graphic card could support high resolution monitors or not.

References

  • https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080-ti/
  • https://www.amd.com/en/products/graphics/radeon-rx-vega-64
  • https://www.gamersnexus.net/dictionary/2-cuda-cores
  • https://www.maketecheasier.com/processors-process-size/
  • https://www.techpowerup.com/forums/threads/explain-to-me-how-memory-width-128-192-256-bit-etc-is-related-to-memory-amount.170588/
  • https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
  • https://www.reddit.com/r/Amd/comments/97w8l3/how_will_amd_compete_with_nvidia_rtx/
  • https://www.anandtech.com/show/13249/nvidia-announces-geforce-rtx-20-series-rtx-2080-ti-2080-2070
  • https://www.tomshardware.com/reviews/graphics-beginners-2,1292-5.html
  • https://www.gamersnexus.net/guides/1747-what-is-texture-fill-rate-defined
  • https://stackoverflow.com/questions/15224095/gpu-pixel-and-texel-write-speed