PyTorch Distributed Training
Introduction
PyTorch has relatively simple interface for distributed training. To do distributed training, the model would just have to be wrapped using DistributedDataParallel
and the training script would just have to be launched using torch.distributed.launch
. Although PyTorch has offered a series of tutorials on distributed training, I found it insufficient or overwhelming to help the beginners to do state-of-the-art PyTorch distributed training. Some key details were missing and the usages of Docker container in distributed training were not mentioned at all.
In this blog post, I would like to present a simple implementation of PyTorch distributed training on CIFAR-10 classification using DistributedDataParallel
wrapped ResNet models. The usage of Docker container for distributed training and how to start distributed training using torch.distributed.launch
would also be covered.
Examples
Source Code
The entire training script consists of a hundred lines of code. Most of the code should be easy to understand.
1 | import torch |
Caveats
The caveats are as the follows:
- Use
--local_rank
forargparse
if we are going to usetorch.distributed.launch
to launch distributed training. - Set random seed to make sure that the models initialized in different processes are the same. (Updates on 3/19/2021: PyTorch
DistributedDataParallel
starts to make sure the model initial states are the same across different processes. So the purpose of setting random seed becomes reproducing the distributed training.) - Use
DistributedDataParallel
to wrap the model for distributed training. - Use
DistributedSampler
to training data loader. - To save models, each node would save a copy of the checkpoint file in the local hard drive.
- Downloading dataset and making directories should be avoided in the distributed training program as they are not multi-process safe, unless we use some sort of barriers, such as
torch.distributed.barrier
. - The node communication bandwidth are extremely important for multi-node distributed training. Instead of randomly finding two computers in the network, try to use the nodes from the specialized computing clusters, since the communications between the nodes are highly optimized.
Launching Distributed Training
In this particular experiment, I tested the program using two nodes. Each of the nodes has 8 GPUs and each GPU would launch one process. We also used Docker container to make sure that the environments are exactly the same and reproducible.
Docker Container
To start Docker container, we have to make a copy of the script to each node in the distributed system, and run the following command in the terminal of each node.
Here, pytorch:1.5.0
is a Docker image which has PyTorch 1.5.0 installed (we could use NVIDIA’s PyTorch NGC Image), --network host
makes sure that the distributed network communication between nodes would not be prevented by Docker containerization.
Preparations
Download the dataset on each node before starting distributed training.
1 | $ mkdir -p data |
Creating directories for saving models before starting distributed training.
Training From Scratch
In the Docker terminal of the first node, we run the following command.
1 | $ python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr="192.168.0.1" --master_port=1234 resnet_ddp.py |
Here, 192.168.0.1
is the IP address of the first node.
In the Docker terminal of the second node, we run the following command.
1 | $ python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr="192.168.0.1" --master_port=1234 resnet_ddp.py |
Note that the only difference between the two commands is --node_rank
.
There would always be some delay between the execution of the two commands in the two nodes. But don’t worry. The first node would wait for the second node, and they would start and train together.
The following messages are expected from the two nodes.
1 | ***************************************** |
Sometimes, even if the hosts have NCCL, the distributed training would be frozen if the communication via NCCL has problems. To troubleshoot, please run distributed training on one single node to see if the training could be performed without any problem. For example, if a host has 8 GPUs, we could run two Docker containers on the host, each instance uses 4 GPUs for training.
Resume Training From Checkpoint
Sometimes, the training would be disrupted for some reason. To resume training, we added --resume
as the argument for the program.
In the Docker terminal of the first node, we run the following command.
1 | $ python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr="192.168.0.1" --master_port=1234 resnet_ddp.py --resume |
In the Docker terminal of the second node, we run the following command.
1 | $ python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr="192.168.0.1" --master_port=1234 resnet_ddp.py --resume |
Kill Distributed Training
I have talked about how to kill PyTorch distributed training in “Kill PyTorch Distributed Training Processes”. So I am not going to elaborate it here.
Miscellaneous
In case the evaluation dataset is also very large, please consider using PyTorch distributed evaluation.
References
PyTorch Distributed Training