NVIDIA Docker CUDA Compatibility
Introduction
NVIDIA NGC CUDA Docker containers are extremely convenient tools for developing and deploying CUDA applications. They allow us to run almost any CUDA Runtime library and CUDA applications inside the Docker container on a platform that has Docker installed, making the code portable and reproducible. Because of this, I have been using NVIDIA NGC CUDA Docker containers for all my CUDA development and deployment work. On my personal computers, I only installed the NVIDIA CUDA Driver and Docker. When I need to use CUDA Runtime library, I just pull the NVIDIA NGC CUDA Docker container and run it. I never installed CUDA Runtime library on my personal computers.
However, recently I encountered some weird issues when using NVIDIA NGC CUDA Docker containers. After some investigation, I found that the issues were caused by the incompatibility between the CUDA Runtime library version inside the Docker container and the CUDA Driver version on the host. In this blog, I will share my experience and explain why we should try to use the same CUDA Driver version on the host as the CUDA Runtime library version inside the Docker container.
Weird Issues Caused by NVIDIA Docker Compatibility
Recently, I was implementing some CUDA kernels on Ubuntu using NVIDIA NGC CUDA Docker containers. I encountered some issues which I could not explain.
For example, for some for
loops, if I used #pragma unroll
to unroll the loop body, the CUDA kernel will not produce correct results on one machine with GV100 GPU installed. However, the same code using #pragma unroll
works fine on the other machine with RTX 3090 GPU installed. Both machines run the same version of Ubuntu and use the same version of NVIDIA NGC CUDA Docker containers nvcr.io/nvidia/cuda:12.0.1-devel-ubuntu22.04
.
Moreover, the following two if
statements are completely equivalent. However, the CUDA kernel using the first if
statement produces correct results on both machines, while the CUDA kernel using the second if
statement produces incorrect results on the machine with GV100 GPU installed.
1 | size_t m, n, C_row_idx, C_col_idx, i; |
These issues seem to indicate that CUDA compiler has severe problems. However, I don’t believe CUDA compiler would have such naive bugs.
After some investigation, I found that even though the two machines run the same version of Ubuntu and use the same version of NVIDIA NGC CUDA Docker containers, the CUDA Driver versions are different. The machine with GV100 GPU installed has Driver Version: 470.223.02 CUDA Version: 11.4
on host, while the machine with RTX 3090 GPU installed has Driver Version: 525.147.05 CUDA Version: 12.0
on host. So we are running the NVIDIA NGC CUDA Docker container nvcr.io/nvidia/cuda:12.0.1-devel-ubuntu22.04
that has CUDA Runtime library of version 12.0.1 installed on CUDA Driver of version 11.4 on the machine with GV100 GPU installed. This is the reason why some of the CUDA kernels behave incorrectly on the machine with GV100 GPU installed.
Therefore, I used the NVIDIA NGC CUDA Docker container nvcr.io/nvidia/cuda:11.4.3-devel-ubuntu20.04
on the machine with GV100 GPU installed instead. This time, the CUDA kernels that have issues previously all worked perfectly fine.
Conclusions
The CUDA backward and forward compatibilities are great features that allow us to run almost any CUDA Runtime library and CUDA applications inside the Docker container. However, that does not mean we should assume using NVIDIA NGC CUDA Docker containers will always work without carefully checking the CUDA Driver version on the host. We should try to use the same CUDA Driver version on the host as the CUDA Runtime library version inside the Docker container. Otherwise, we might encounter some weird issues that are hard to explain.
References
NVIDIA Docker CUDA Compatibility
https://leimao.github.io/blog/NVIDIA-Docker-CUDA-Compatibility/