Kill PyTorch Distributed Training Processes

Introduction

PyTorch DistributedDataParallel is a convenient wrapper for distributed data parallel training. It is also compatible with distributed model parallel training. The major difference between PyTorch DistributedDataParallel and PyTorch DataParallel is that PyTorch DistributedDataParallel uses a multi-process algorithm and PyTorch DataParallel uses a single-process multi-thread algorithm. Usually, using PyTorch DistributedDataParallel, each process on each node uses one GPU, whereas using PyTorch DataParallel, each thread in the process uses one GPU.

To start PyTorch multi-node distributed training, usually we have to run python -m torch.distributed.launch commands on different nodes. For example, to start a two-node distributed training whose master node is using address 192.168.1.1 and port 1234.

On node one, we run the following command:

1
2
3
4
$ python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
--nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
and all other arguments of your training script)

On node two, we run the following command:

1
2
3
4
$ python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
--nnodes=2 --node_rank=1 --master_addr="192.168.1.1"
--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
and all other arguments of your training script)

Zombie Processes

Sometimes, we would kill the multi-process program using Ctrl + C. When we check the GPU usage using nvidia-smi after the kill, usually we would see the GPU utilities on the node becomes zero. However, when we restart the program, sometimes we would see the following error message:

1
2
3
4
5
6
7
8
Traceback (most recent call last):
File "ddp.py", line 103, in <module>
torch.distributed.init_process_group(backend='nccl')
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 393, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

This means that the address and the port is occupied and we are not allowed to start the distributed training using the previous address and port. Why would this happen? This is because when we hit Ctrl + C, only one process is killed and the rest processes on the node is not killed. Thus they are still occupying the address and port. We could confirm this by running top.

1
2
3
4
5
6
7
8
9
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                          
1750 root 20 0 39.210g 2.464g 538372 R 100.3 0.5 8:23.22 python
1751 root 20 0 38.835g 2.407g 511792 R 100.3 0.5 8:14.84 python
1753 root 20 0 39.013g 2.515g 519984 R 100.3 0.5 8:14.99 python
1755 root 20 0 39.029g 2.507g 511508 R 100.3 0.5 8:14.73 python
1752 root 20 0 38.837g 2.409g 511724 R 100.0 0.5 8:14.91 python
1754 root 20 0 38.838g 2.409g 510980 R 100.0 0.5 8:14.91 python
1756 root 20 0 38.845g 2.418g 512096 R 100.0 0.5 8:14.71 python
1757 root 20 0 38.851g 2.423g 511780 R 100.0 0.5 8:14.90 python

In this case, we have 8 GPUs on one node and thus 8 processes after program execution. After hitting Ctrl + C, one process is killed and we still have 7 processes left.

In order to release these resources and free the address and port, we could write down the PIDs of these processes and use kill to kill each of them. However, this looks dumb. Is there any way to do it smarter? To capture the PIDs automatically, we use ps instead of top.

1
$ kill $(ps aux | grep YOUR_TRAINING_SCRIPT.py | grep -v grep | awk '{print $2}')

Here $(ps aux | grep YOUR_TRAINING_SCRIPT.py | grep -v grep | awk '{print $2}') will return all the PIDs that YOUR_TRAINING_SCRIPT.py runs.

1
2
3
4
5
6
7
8
9
$ ps aux | grep ddp.py | grep -v grep | awk '{print $2}'
1750
1751
1752
1753
1754
1755
1756
1757

References

Author

Lei Mao

Posted on

04-25-2020

Updated on

04-25-2020

Licensed under


Comments