Build and Develop PyTorch
Introduction
If we would like to contribute code to PyTorch, we will have to build the PyTorch main branch code from source, apply our changes, and pass all the unit tests. Although we can use the PyTorch CI on GitHub when we create a pull request, the unit tests that CI runs is large and complete, which will take a very long time to finish. So building, developing, and testing locally are more preferable.
In this blog post, I would like to discuss how to build, develop, and contribute code to PyTorch using a Docker container.
PyTorch Development Docker Container
Create Dockerfile
The major component of the Docker container for PyTorch development is CMake.
1 | FROM nvcr.io/nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04 |
Build Docker Image
We could build the Docker image using different CMake versions.
1 | $ CMAKE_VERSION=3.25.1 |
Build PyTorch
Clone PyTorch Repositories
The PyTorch repository has to be cloned from GitHub. The torchvision
and torchaudio
libraries are optional and sometimes necessary for unit testing.
1 | $ git clone --recursive https://github.com/pytorch/pytorch.git |
Run Docker Contaienr
Mount the working directory that contains the PyTorch repository to Docker container.
1 | $ docker run -it --rm --gpus all -v $(pwd):/mnt torch-build:0.0.1 |
Build PyTorch for Development
Build PyTorch from source in development mode.
1 | $ cd /mnt/pytorch |
With python setup.py develop
, when we change the PyTorch files, we don’t have to rebuild the entire PyTorch library.
When MAX_JOBS
is not set or MAX_JOBS
is too large, it’s likely that we will encounter the following error during the PyTorch build and the error is extremely confusing and misleading.
1 | internal compiler error: Segmentation fault |
If the build failed because of this, we can kill the build and re-run the build command. The files that have already been built were cached and we will not have to start from the beginning again.
Although I have not investigated this, I suspect that it is a compiler implementation bug. When the compile is executed in multi-thread, the memory allocation happened in each thread, somehow some memory allocations were not successful and null pointers were returned. However, those null pointers were directly used without checking whether they were valid or not. Consequently, segmentation fault happened.
In my case, I have an Intel Core i9-9900K and 32 GB memory. This error will be very likely to happen when MAX_JOBS
is large, say 16, and when I was doing some other stuff on the computer, such as watching videos, while I was also building PyTorch. The chance of encountering this error will be reduced if MAX_JOBS
is small or a computer that has much larger memory, say 128 GB, is used.
Run Unit Test
Install Dependencies
Some unit tests require torchvision
and torchaudio
. So we will build torchvision
and torchaudio
if necessary.
1 | $ cd /mnt/vision |
1 | $ cd /mnt/audio |
There might be some other Python dependencies required for unit tests. But they should be easy to install via pip
.
Run Unit Test for Development
To run unit tests, simply go to pytorch/test
, and run the selected Python unit test files. Sometimes, if the unit test file is too large, we could either comment out some of the unit tests or use pytest
to run unit tests more selectively.
Run Code Formatting
PyTorch uses lintrunner
for code formatting. Following the PyTorch official lintrunner instruction to install and setup. Before submitting the code to PyTorch for pull request, run lintrunner
using
1 | $ cd /mnt/pytorch |
References
Build and Develop PyTorch