# PyTorch Distributed Evaluation

## Introduction

Data is probably one of the most important things to deep learning. Nowadays, in many applications, not only the training data starts to explode, but also the evaluation data. In my previous post “PyTorch Distributed Training”, we have discussed how to run PyTorch distributed training to accelerate model training, but it seems that in some cases, model evaluation needs to be accelerated by distributed computing as well.

In this blog post, I would like to discuss about how to use PyTorch and TorchMetrics to run PyTorch distributed evaluation. Specifically, I will evaluate the pre-trained ResNet-18 model from TorchVision models on a subset of ImageNet evaluation dataset.

## Evaluation Dataset Preparation

Instead of using the full ImageNet dataset, we will use a smaller subset of the ImageNet dataset, ImageNet-1K, for evaluation. The dataset is roughly 260 MB and could be downloaded from MIT Han Lab.

## Docker Container

To make all the experiments reproducible, we used the NVIDIA NGC PyTorch Docker image.

In addition, please do install TorchMetrics 0.7.1 inside the Docker container.

## Single-Node Single-GPU Evaluation

We created the implementation of single-node single-GPU evaluation, evaluate the pre-trained ResNet-18, and use the evaluation accuracy as the reference. The implementation was derived from the PyTorch official ImageNet example and should be easy to understand by most of the PyTorch users.

Although the pre-trained ResNet-18 model was evaluated on a subset of the ImageNet evaluation dataset, the accuracy 69.300% is quite close to the accuracy 69.758% evaluated on the full ImageNet evaluation dataset, reported on the TorchVision models webpage.

## TorchMetrics Single-Node Multi-GPU Evaluation

TorchMetrics provides module metric to run evaluations using single GPU, multiple GPUs, or multiple nodes. This is the corresponding ResNet-18 TorchMetrics evaluation implementation for single-node multi-GPU evaluations.

Notice that we intentionally set the world_size to be 1 to enforce the evaluation to use one single GPU. The multi-GPU evaluation implementation using one single GPU got exactly the same evaluation accuracy.

Let’s further proceed to using two GPUs for evaluation by changing the world_size from 1 to 2, namely,

The multi-GPU evaluation implementation using two GPUs also got exactly the same evaluation accuracy. Also notice that the number of batches becomes smaller as we used multiple GPUs for evaluation.

## TorchMetrics Multi-Node Multi-GPU Evaluation

Launching multi-node multi-GPU evaluation requires using tools such as torch.distributed.launch. I have discussed the usages of torch.distributed.launch for PyTorch distributed training in my previous post “PyTorch Distributed Training”, and I am not going to elaborate it here. More information could also be found on the PyTorch official example “Launching and Configuring Distributed Data Parallel Applications”.

## Caveats

Let $N$ be the number of nodes on which the application is running and $G$ be the number of GPUs per node. The total number of application processes running across all the nodes at one time is called the world_size, $W$ and the number of processes running on each node is referred to as the local_world_size, $L$.

In the single-node multi-GPU scenario, we have the same value for world_size and nprocs and the values of them should be smaller or equal to the number of GPUs in the node. The world_size in this context really means the local_world_size in the node. So in the single-node multi-GPU scenario, world_size and nprocs has to be exactly the same by definition.

For example, in the single-node multi-GPU scenario, suppose $N = G= 8$, when $W = L = 8$, each process could use up to one single GPU; when $W = L = 1$, the single process could use up to 8 GPU.

That’s why in our single-node multi-GPU evaluation implementation, we have the following code for spawning jobs, where world_size = nprocs.

Also notice that mp.spawn can only used in the single-node multi-GPU scenario, but should not be used in the multi-node multi-GPU scenario.

Lei Mao

02-05-2022

02-05-2022