Build and Develop CUTLASS CUDA Kernels
Introduction
CUTLASS is a header-only library that consists of a collection of CUDA C++ template abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA.
In this blog post, we will build CUTLASS and CuTe CUDA kernels using CMake in a CUDA Docker container.
CUDA Docker Container
When it comes to creating a CUDA Docker container for CUTLASS kernel development, we will encounter an option. Either we will git clone the CUTLASS header-only library inside the Docker container, or the CUTLASS header-only library will be part of the CUDA kernel source code.
In the beginning, I cloned the CUTLASS header-only library inside the Docker container. However, it became prohibitive when I tried to check the header-only library implementation from the Docker container. Although I could still try to check the CUTLASS header-only library implementation from the Docker container if the Docker container is a VS Code Dev Container, it becomes not friendly if I want to modify and contribute to the CUTLASS header-only library. Therefore, I decided to treat the CUTLASS header-only library as part of the CUDA kernel source code.
Build Docker Image
The following CUDA Dockerfile will be used for CUTLASS kernel development. It can also be found in my CUTLASS Examples GitHub repository.
1 | FROM nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04 |
To build the CUTLASS Docker image locally, please run the following command.
1 | $ docker build -f docker/cuda.Dockerfile --no-cache --tag cuda:12.4.1 . |
Run Docker Container
To build the CUTLASS Docker container, please run the following command.
1 | $ docker run -it --rm --gpus device=0 -v $(pwd):/mnt -w /mnt cuda:12.4.1 |
CUTLASS Examples
To show that the CUTLASS we installed works inside the Docker container, we will build and run two CUTLASS C++ examples copied from the CUTLASS GitHub repository without any modification.
CUTLASS is header-only. There are two key header directories to include for each CUTLASS build target, including cutlass/include
and cutlass/tools/util/include
.
1 | cmake_minimum_required(VERSION 3.28) |
For each build target, the experimental flag --expt-relaxed-constexpr
is needed for the NVCC compiler to use some constexpr
from the host code in the device code.
1 | cmake_minimum_required(VERSION 3.28) |
Build Examples
To build the CUTLASS examples using CMake, please run the following command.
1 | $ cmake -B build |
Run Examples
To run the CUTLASS examples, please run the following commands.
1 | $ ./build/examples/gemm_api_v2/CUTLASS-GEMM-API-V2 |
1 | $ ./build/examples/gemm_api_v3/CUTLASS-GEMM-API-V3 |
References
Build and Develop CUTLASS CUDA Kernels
https://leimao.github.io/blog/Build-Develop-CUTLASS-CUDA-Kernels/