CUDA Matrix Multiplication
Introduction
CUDA is a parallel computing platform and programming language that allows software to use certain types of graphics processing unit (GPU) for general purpose processing, an approach called generalpurpose computing on GPUs (GPGPU). It could significantly enhance the performance of programs that could be computed with massive parallelism.
Matrix multiplication is a typical application that could be computed with massive parallelism. In this blog post, I would like to present a “helloworld” CUDA example of matrix multiplications and its preliminary optimizations.
Matrix Multiplication
There are two common matrix multiplication forms. The ordinary matrix multiplication mm
and the batched matrix multiplication bmm
.
$$
\begin{align}
\mathbf{C}^{n \times p} &= \mathbf{A}^{n \times m} \mathbf{B}^{m \times p} \\
\mathbf{C}^{b \times n \times p} &= \mathbf{A}^{b \times n \times m} \mathbf{B}^{b \times m \times p} \\
\end{align}
$$
The reader could find the specifications of mm
and bmm
from PyTorch documentation torch.mm
and torch.bmm
.
In the following example, we first implemented the mm
and bmm
using C++. Then we implemented the mm
using CUDA and naturally extended the mm
implementation to the bmm
implementation. Finally, we verified the correctness of the mm
and bmm
CUDA implementations.
Naive Implementation
This is the single source code file that contains the CPU and CUDA implementations for the matrix multiplication mm
and the batched matrix multiplication bmm
.
1 

Run Naive Example
Building and running the example requires an NVIDIA GPU. We also used NVIDIA official Docker container to set up the building environment.
To start the Docker container, please run the following command on the host computer.
1  $ docker run it rm gpus all capadd=SYS_PTRACE securityopt seccomp=unconfined v $(pwd):/mnt nvcr.io/nvidia/cuda:11.7.1develubuntu22.04 
To build and run the application, please run the following command in the Docker container.
1  $ cd /mnt/ 
We should expect no assertion error or any other kind of error for build and execution. The latencies were measured on a NVIDIA RTX 3090 GPU.
Matrix Multiplication Optimizations
The CUDA kernel optimization is usually all about how to accelerate the data traffic without affecting the number of math operations. To get the CUDA kernel fully optimized for GPU, the user would have to be very experienced with lowlevel GPU features and specifications and CUDA programming. But this does not prevent us from doing some preliminary optimization based on some shallow understandings of GPU.
Make Matrix Multiplication More MathBound
GPU is very friendly with mathbound operations. According to my previous blog post “MathBound VS MemoryBound Operations”, if the number of operations remains the same and the number of memory IO bytes gets reduced, the operation will become more mathbound. That is to say, we want
$$
\begin{gather}
\frac{N_{\text{op}}}{N_{\text{byte}}} > \frac{\text{BW}_{\text{math}}}{\text{BW}_{\text{mem}}}
\end{gather}
$$
In our matrix multiplication naive CUDA implementation,
$$
\begin{align}
\mathbf{C}^{n \times p} &= \mathbf{A}^{n \times m} \mathbf{B}^{m \times p} \\
\end{align}
$$
We have to do $mnp$ multiplication and additions, $2mnp$ reads from memory, and $mp$ writes to memory. We could ignore the $mp$ writes from memory IO because the $2mnp$ reads is usually much more than the $mp$ writes.
Suppose we are doing FP32 matrix multiplication,
$$
\begin{align}
\frac{N_{\text{op}}}{N_{\text{byte}}}
&= \frac{2 \times mnp}{2mnp \times 4} \\
&= \frac{1}{4} \\
\end{align}
$$
For a modern GPU such as NVIDIA RTX 3090, for FP32 math,
$$
\begin{align}
\frac{\text{BW}_{\text{math}}}{\text{BW}_{\text{mem}}} &= \frac{35.58}{0.936} \\
&= 38.0 \\
\end{align}
$$
We could see that the naive CUDA matrix multiplication implementation does not get even close to mathbound. Since $N_{\text{op}}$ should be a constant in matrix multiplication, let’s see if we could reduce $N_{\text{byte}}$ by caching.
Ideally, if we could cache the two full operand matrices $\mathbf{A}^{n \times m}$ and $\mathbf{B}^{m \times p}$, we could make the matrix multiplication most mathbound. However, since the caching size is limited and the implementation is supposed to support matrix multiplications with all different sizes, caching the full matrices is not technically possible.
Matrix Multiplication Decomposition
It is possible to decompose matrix multiplication mm
into smaller matrix multiplications.
$$
\mathbf{A} =
\begin{bmatrix}
\mathbf{A}_{1,1}^{d \times d} & \mathbf{A}_{1,2}^{d \times d} & \cdots & \mathbf{A}_{1,n/d}^{d \times d} \\
\mathbf{A}_{2,1}^{d \times d} & \mathbf{A}_{2,2}^{d \times d} & \cdots & \mathbf{A}_{2,n/d}^{d \times d} \\
\vdots & \vdots & \ddots & \vdots \\
\mathbf{A}_{m/d,1}^{d \times d} & \mathbf{A}_{m/d,2}^{d \times d} & \cdots & \mathbf{A}_{m/d,n/d}^{d \times d} \\
\end{bmatrix}
$$
$$
\mathbf{B} =
\begin{bmatrix}
\mathbf{B}_{1,1}^{d \times d} & \mathbf{B}_{1,2}^{d \times d} & \cdots & \mathbf{B}_{1,p/d}^{d \times d} \\
\mathbf{B}_{2,1}^{d \times d} & \mathbf{B}_{2,2}^{d \times d} & \cdots & \mathbf{B}_{2,p/d}^{d \times d} \\
\vdots & \vdots & \ddots & \vdots \\
\mathbf{B}_{n/d,1}^{d \times d} & \mathbf{B}_{n/d,2}^{d \times d} & \cdots & \mathbf{B}_{n/d,p/d}^{d \times d} \\
\end{bmatrix}
$$
$$
\mathbf{C} =
\begin{bmatrix}
\mathbf{C}_{1,1}^{d \times d} & \mathbf{C}_{1,2}^{d \times d} & \cdots & \mathbf{C}_{1,p/d}^{d \times d} \\
\mathbf{C}_{2,1}^{d \times d} & \mathbf{C}_{2,2}^{d \times d} & \cdots & \mathbf{C}_{2,p/d}^{d \times d} \\
\vdots & \vdots & \ddots & \vdots \\
\mathbf{C}_{m/d,1}^{d \times d} & \mathbf{C}_{m/d,2}^{d \times d} & \cdots & \mathbf{C}_{m/d,p/d}^{d \times d} \\
\end{bmatrix}
$$
$$
\mathbf{C}_{i,j}^{d \times d} = \sum_{k=1}^{n/d} \mathbf{A}_{i,k}^{d \times d} \mathbf{B}_{k,j}^{d \times d}
$$
The decomposition does not alter the number of operations $N_{\text{op}}$.
$$
\begin{align}
N_{\text{op}} &= 2d^3 \left( \frac{n}{d} \right) \left( \frac{m}{d} \frac{p}{d}\right) \\
&= 2mnp \\
\end{align}
$$
Because small matrices $\mathbf{A}_{i,k}^{d \times d}$ and $\mathbf{B}_{k,j}^{d \times d}$ could be cached, the memory IO bytes could be reduced, and the overall matrix multiplication could become more math bound. Let’s calculate how much memory IO bytes is needed in this case.
$$
\begin{align}
N_{\text{byte}} &= 2d^2 \times 4 \times \left( \frac{n}{d} \right) \left( \frac{m}{d} \frac{p}{d}\right) \\
&= \frac{8mnp}{d} \\
\end{align}
$$
Therefore,
$$
\begin{align}
\frac{N_{\text{op}}}{N_{\text{byte}}}
&= \frac{2mnp}{\frac{8mnp}{d}} \\
&= \frac{d}{4} \\
\end{align}
$$
Notice that when $d=1$, the matrix multiplication falls back to the naive matrix multiplication. When $d$ becomes larger, the implementation becomes more mathbound.
Optimized Implementation
The following implementation decomposed the matrix multiplication into multiple small matrix multiplications. The source code could be found on GitHub.
1 

Run Optimized Example
In the same Docker container, build and run the following application. We could see that the latency of INT32 and FP32 matrix multiplication got improved too different degrees.
1  $ nvcc mm_optimization.cu o mm_optimization std=c++14 
Miscellaneous
There are more subtle factors affecting the performance and there are more optimization opportunities to further optimize the matrix multiplication implementation. But those requires more thorough understanding of GPU and CUDA.
References
CUDA Matrix Multiplication