CUTLASS is a header-only library that consists of a collection of CUDA C++ template abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA.
In this blog post, we will build CUTLASS and CuTe CUDA kernels using CMake in a CUDA Docker container.
CUDA Docker Container
When it comes to creating a CUDA Docker container for CUTLASS kernel development, we will encounter an option. Either we will git clone the CUTLASS header-only library inside the Docker container, or the CUTLASS header-only library will be part of the CUDA kernel source code.
In the beginning, I cloned the CUTLASS header-only library inside the Docker container. However, it became prohibitive when I tried to check the header-only library implementation from the Docker container. Although I could still try to check the CUTLASS header-only library implementation from the Docker container if the Docker container is a VS Code Dev Container, it becomes not friendly if I want to modify and contribute to the CUTLASS header-only library. Therefore, I decided to treat the CUTLASS header-only library as part of the CUDA kernel source code.
Build Docker Image
The following CUDA Dockerfile will be used for CUTLASS kernel development. It can also be found in my CUTLASS Examples GitHub repository.
To show that the CUTLASS we installed works inside the Docker container, we will build and run two CUTLASS C++ examples copied from the CUTLASS GitHub repository without any modification.
CUTLASS is header-only. There are two key header directories to include for each CUTLASS build target, including cutlass/include and cutlass/tools/util/include.
CMakelists.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
cmake_minimum_required(VERSION 3.28)
project(CUTLASS-Examples VERSION 0.0.1 LANGUAGES CXX CUDA)
# Find CUDA Toolkit find_package(CUDAToolkit REQUIRED)
# Set CUTLASS include directories find_path(CUTLASS_INCLUDE_DIR cutlass/cutlass.h HINTS cutlass/include) find_path(CUTLASS_UTILS_INCLUDE_DIR cutlass/util/host_tensor.h HINTS cutlass/tools/util/include)
add_subdirectory(examples)
For each build target, the experimental flag --expt-relaxed-constexpr is needed for the NVCC compiler to use some constexpr from the host code in the device code.
CMakelists.txt
1 2 3 4 5 6 7 8 9 10
cmake_minimum_required(VERSION 3.28)
project(CUTLASS-GEMM-API-V3 VERSION 0.0.1 LANGUAGES CXX CUDA)
# Set the CUDA architecture to compile the code for # https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html add_executable(${PROJECT_NAME} main.cu) target_include_directories(${PROJECT_NAME} PRIVATE ${CUTLASS_INCLUDE_DIR}${CUTLASS_UTILS_INCLUDE_DIR}) set_target_properties(${PROJECT_NAME} PROPERTIES CUDA_ARCHITECTURES native) target_compile_options(${PROJECT_NAME} PRIVATE --expt-relaxed-constexpr)
Build Examples
To build the CUTLASS examples using CMake, please run the following command.