# CUDA Matrix Multiplication

## Introduction

CUDA is a parallel computing platform and programming language that allows software to use certain types of graphics processing unit (GPU) for general purpose processing, an approach called general-purpose computing on GPUs (GPGPU). It could significantly enhance the performance of programs that could be computed with massive parallelism.

Matrix multiplication is a typical application that could be computed with massive parallelism. In this blog post, I would like to present a “hello-world” CUDA example of matrix multiplications and its preliminary optimizations.

## Matrix Multiplication

There are two common matrix multiplication forms. The ordinary matrix multiplication mm and the batched matrix multiplication bmm.

\begin{align} \mathbf{C}^{n \times p} &= \mathbf{A}^{n \times m} \mathbf{B}^{m \times p} \\ \mathbf{C}^{b \times n \times p} &= \mathbf{A}^{b \times n \times m} \mathbf{B}^{b \times m \times p} \\ \end{align}

The reader could find the specifications of mm and bmm from PyTorch documentation torch.mm and torch.bmm.

In the following example, we first implemented the mm and bmm using C++. Then we implemented the mm using CUDA and naturally extended the mm implementation to the bmm implementation. Finally, we verified the correctness of the mm and bmm CUDA implementations.

### Naive Implementation

This is the single source code file that contains the CPU and CUDA implementations for the matrix multiplication mm and the batched matrix multiplication bmm.

### Run Naive Example

Building and running the example requires an NVIDIA GPU. We also used NVIDIA official Docker container to set up the building environment.

To start the Docker container, please run the following command on the host computer.

To build and run the application, please run the following command in the Docker container.

We should expect no assertion error or any other kind of error for build and execution. The latencies were measured on a NVIDIA RTX 3090 GPU.

## Matrix Multiplication Optimizations

The CUDA kernel optimization is usually all about how to accelerate the data traffic without affecting the number of math operations. To get the CUDA kernel fully optimized for GPU, the user would have to be very experienced with low-level GPU features and specifications and CUDA programming. But this does not prevent us from doing some preliminary optimization based on some shallow understandings of GPU.

### Make Matrix Multiplication More Math-Bound

GPU is very friendly with math-bound operations. According to my previous blog post “Math-Bound VS Memory-Bound Operations”, if the number of operations remains the same and the number of memory IO bytes gets reduced, the operation will become more math-bound. That is to say, we want

$$\begin{gather} \frac{N_{\text{op}}}{N_{\text{byte}}} > \frac{\text{BW}_{\text{math}}}{\text{BW}_{\text{mem}}} \end{gather}$$

In our matrix multiplication naive CUDA implementation,

\begin{align} \mathbf{C}^{n \times p} &= \mathbf{A}^{n \times m} \mathbf{B}^{m \times p} \\ \end{align}

We have to do $mnp$ multiplication and additions, $2mnp$ reads from memory, and $mp$ writes to memory. We could ignore the $mp$ writes from memory IO because the $2mnp$ reads is usually much more than the $mp$ writes.

Suppose we are doing FP32 matrix multiplication,

\begin{align} \frac{N_{\text{op}}}{N_{\text{byte}}} &= \frac{2 \times mnp}{2mnp \times 4} \\ &= \frac{1}{4} \\ \end{align}

For a modern GPU such as NVIDIA RTX 3090, for FP32 math,

\begin{align} \frac{\text{BW}_{\text{math}}}{\text{BW}_{\text{mem}}} &= \frac{35.58}{0.936} \\ &= 38.0 \\ \end{align}

We could see that the naive CUDA matrix multiplication implementation does not get even close to math-bound. Since $N_{\text{op}}$ should be a constant in matrix multiplication, let’s see if we could reduce $N_{\text{byte}}$ by caching.

Ideally, if we could cache the two full operand matrices $\mathbf{A}^{n \times m}$ and $\mathbf{B}^{m \times p}$, we could make the matrix multiplication most math-bound. However, since the caching size is limited and the implementation is supposed to support matrix multiplications with all different sizes, caching the full matrices is not technically possible.

### Matrix Multiplication Decomposition

It is possible to decompose matrix multiplication mm into smaller matrix multiplications.

$$\mathbf{A} = \begin{bmatrix} \mathbf{A}_{1,1}^{d \times d} & \mathbf{A}_{1,2}^{d \times d} & \cdots & \mathbf{A}_{1,n/d}^{d \times d} \\ \mathbf{A}_{2,1}^{d \times d} & \mathbf{A}_{2,2}^{d \times d} & \cdots & \mathbf{A}_{2,n/d}^{d \times d} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{A}_{m/d,1}^{d \times d} & \mathbf{A}_{m/d,2}^{d \times d} & \cdots & \mathbf{A}_{m/d,n/d}^{d \times d} \\ \end{bmatrix}$$

$$\mathbf{B} = \begin{bmatrix} \mathbf{B}_{1,1}^{d \times d} & \mathbf{B}_{1,2}^{d \times d} & \cdots & \mathbf{B}_{1,p/d}^{d \times d} \\ \mathbf{B}_{2,1}^{d \times d} & \mathbf{B}_{2,2}^{d \times d} & \cdots & \mathbf{B}_{2,p/d}^{d \times d} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{B}_{n/d,1}^{d \times d} & \mathbf{B}_{n/d,2}^{d \times d} & \cdots & \mathbf{B}_{n/d,p/d}^{d \times d} \\ \end{bmatrix}$$

$$\mathbf{C} = \begin{bmatrix} \mathbf{C}_{1,1}^{d \times d} & \mathbf{C}_{1,2}^{d \times d} & \cdots & \mathbf{C}_{1,p/d}^{d \times d} \\ \mathbf{C}_{2,1}^{d \times d} & \mathbf{C}_{2,2}^{d \times d} & \cdots & \mathbf{C}_{2,p/d}^{d \times d} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{C}_{m/d,1}^{d \times d} & \mathbf{C}_{m/d,2}^{d \times d} & \cdots & \mathbf{C}_{m/d,p/d}^{d \times d} \\ \end{bmatrix}$$

$$\mathbf{C}_{i,j}^{d \times d} = \sum_{k=1}^{n/d} \mathbf{A}_{i,k}^{d \times d} \mathbf{B}_{k,j}^{d \times d}$$

The decomposition does not alter the number of operations $N_{\text{op}}$.

\begin{align} N_{\text{op}} &= 2d^3 \left( \frac{n}{d} \right) \left( \frac{m}{d} \frac{p}{d}\right) \\ &= 2mnp \\ \end{align}

Because small matrices $\mathbf{A}_{i,k}^{d \times d}$ and $\mathbf{B}_{k,j}^{d \times d}$ could be cached, the memory IO bytes could be reduced, and the overall matrix multiplication could become more math bound. Let’s calculate how much memory IO bytes is needed in this case.

\begin{align} N_{\text{byte}} &= 2d^2 \times 4 \times \left( \frac{n}{d} \right) \left( \frac{m}{d} \frac{p}{d}\right) \\ &= \frac{8mnp}{d} \\ \end{align}

Therefore,

\begin{align} \frac{N_{\text{op}}}{N_{\text{byte}}} &= \frac{2mnp}{\frac{8mnp}{d}} \\ &= \frac{d}{4} \\ \end{align}

Notice that when $d=1$, the matrix multiplication falls back to the naive matrix multiplication. When $d$ becomes larger, the implementation becomes more math-bound.

### Optimized Implementation

The following implementation decomposed the matrix multiplication into multiple small matrix multiplications. The source code could be found on GitHub.

### Run Optimized Example

In the same Docker container, build and run the following application. We could see that the latency of INT32 and FP32 matrix multiplication got improved too different degrees.

## Miscellaneous

There are more subtle factors affecting the performance and there are more optimization opportunities to further optimize the matrix multiplication implementation. But those requires more thorough understanding of GPU and CUDA.

Lei Mao

03-21-2022

03-04-2023