Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

PyTorch benchmark is critical for developing fast PyTorch training and inference applications using GPU and CUDA.


In this blog post, I would like to discuss about the correct way for benchmarking PyTorch applications.

PyTorch Benchmark

Synchronization

PyTorch automatically performs necessary synchronization when copying data between CPU and GPU or between two GPUs. However, when there are no such operations, the CPU thread and the CUDA stream could be out of sync, and the CPU thread will never know when certain CUDA operation finishes.


If the user uses a CPU timer to measure the elapsed time of a PyTorch application without synchronization, when the timer stops in the CPU thread, the CUDA operation might be still running, therefore the benchmark performance results will be incorrect.

Warmup Runs

In the benchmark, the first few runs could be slow if the GPU has not warmed up. So for the best practice, we always run a couple of warm up iterations that will not be counted in the profile results.

CPU Timer, CUDA Timer and PyTorch Benchmark Utilities

The time stamp of events could be measured on CPU using implementations such as time or timeit in Python. It could also be measured on CUDA using CUDA event such as the PyTorch torch.cuda.Event CUDA event wrapper. In addition, PyTorch has its own benchmark utilities that help the user run benchmarking. It takes care of the warmup runs and synchronizations automatically. In addition, the PyTorch benchmark utilities include the implementation for multi-thread benchmarking.

Implementation

Let’s benchmark a couple of PyTorch modules, including a custom convolution layer and a ResNet50, using CPU timer, CUDA timer and PyTorch benchmark utilities.


In our custom CPU and CUDA benchmark implementation, we will try placing the timer both outside and inside the iteration loop. We will also test the consequence of not running synchronization.

# benchmark_pytorch.py
from timeit import default_timer as timer
import torch
import torch.nn as nn
import torchvision
import torch.utils.benchmark as benchmark


@torch.no_grad()
def measure_time_host(
    model: nn.Module,
    input_tensor: torch.Tensor,
    num_repeats: int = 100,
    num_warmups: int = 10,
    synchronize: bool = True,
    continuous_measure: bool = True,
) -> float:

    for _ in range(num_warmups):
        _ = model.forward(input_tensor)
    torch.cuda.synchronize()

    elapsed_time_ms = 0

    if continuous_measure:
        start = timer()
        for _ in range(num_repeats):
            _ = model.forward(input_tensor)
        if synchronize:
            torch.cuda.synchronize()
        end = timer()
        elapsed_time_ms = (end - start) * 1000

    else:
        for _ in range(num_repeats):
            start = timer()
            _ = model.forward(input_tensor)
            if synchronize:
                torch.cuda.synchronize()
            end = timer()
            elapsed_time_ms += (end - start) * 1000

    return elapsed_time_ms / num_repeats


@torch.no_grad()
def measure_time_device(
    model: nn.Module,
    input_tensor: torch.Tensor,
    num_repeats: int = 100,
    num_warmups: int = 10,
    synchronize: bool = True,
    continuous_measure: bool = True,
) -> float:

    for _ in range(num_warmups):
        _ = model.forward(input_tensor)
    torch.cuda.synchronize()

    elapsed_time_ms = 0

    if continuous_measure:
        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)
        start_event.record()
        for _ in range(num_repeats):
            _ = model.forward(input_tensor)
        end_event.record()
        if synchronize:
            # This has to be synchronized to compute the elapsed time.
            # Otherwise, there will be runtime error.
            torch.cuda.synchronize()
        elapsed_time_ms = start_event.elapsed_time(end_event)

    else:
        for _ in range(num_repeats):
            start_event = torch.cuda.Event(enable_timing=True)
            end_event = torch.cuda.Event(enable_timing=True)
            start_event.record()
            _ = model.forward(input_tensor)
            end_event.record()
            if synchronize:
                # This has to be synchronized to compute the elapsed time.
                # Otherwise, there will be runtime error.
                torch.cuda.synchronize()
            elapsed_time_ms += start_event.elapsed_time(end_event)

    return elapsed_time_ms / num_repeats


@torch.no_grad()
def run_inference(model: nn.Module,
                  input_tensor: torch.Tensor) -> torch.Tensor:

    return model.forward(input_tensor)


def main() -> None:

    num_warmups = 100
    num_repeats = 1000
    input_shape = (1, 3, 224, 224)

    device = torch.device("cuda:0")

    # model = torchvision.models.resnet18(pretrained=False)
    model = nn.Conv2d(in_channels=input_shape[1],
                      out_channels=256,
                      kernel_size=(5, 5))

    model.to(device)
    model.eval()

    # Input tensor
    input_tensor = torch.rand(input_shape, device=device)

    torch.cuda.synchronize()

    print("Latency Measurement Using CPU Timer...")
    for continuous_measure in [True, False]:
        for synchronize in [True, False]:
            try:
                latency_ms = measure_time_host(
                    model=model,
                    input_tensor=input_tensor,
                    num_repeats=num_repeats,
                    num_warmups=num_warmups,
                    synchronize=synchronize,
                    continuous_measure=continuous_measure,
                )
                print(f"|"
                      f"Synchronization: {synchronize!s:5}| "
                      f"Continuous Measurement: {continuous_measure!s:5}| "
                      f"Latency: {latency_ms:.5f} ms| ")
            except Exception as e:
                print(f"|"
                      f"Synchronization: {synchronize!s:5}| "
                      f"Continuous Measurement: {continuous_measure!s:5}| "
                      f"Latency: N/A     ms| ")
            torch.cuda.synchronize()

    print("Latency Measurement Using CUDA Timer...")
    for continuous_measure in [True, False]:
        for synchronize in [True, False]:
            try:
                latency_ms = measure_time_device(
                    model=model,
                    input_tensor=input_tensor,
                    num_repeats=num_repeats,
                    num_warmups=num_warmups,
                    synchronize=synchronize,
                    continuous_measure=continuous_measure,
                )
                print(f"|"
                      f"Synchronization: {synchronize!s:5}| "
                      f"Continuous Measurement: {continuous_measure!s:5}| "
                      f"Latency: {latency_ms:.5f} ms| ")
            except Exception as e:
                print(f"|"
                      f"Synchronization: {synchronize!s:5}| "
                      f"Continuous Measurement: {continuous_measure!s:5}| "
                      f"Latency: N/A     ms| ")
            torch.cuda.synchronize()

    print("Latency Measurement Using PyTorch Benchmark...")
    num_threads = 1
    timer = benchmark.Timer(stmt="run_inference(model, input_tensor)",
                            setup="from __main__ import run_inference",
                            globals={
                                "model": model,
                                "input_tensor": input_tensor
                            },
                            num_threads=num_threads,
                            label="Latency Measurement",
                            sub_label="torch.utils.benchmark.")

    profile_result = timer.timeit(num_repeats)
    # https://pytorch.org/docs/stable/_modules/torch/utils/benchmark/utils/common.html#Measurement
    print(f"Latency: {profile_result.mean * 1000:.5f} ms")


if __name__ == "__main__":

    main()

Docker Container

All the benchmarks were conducted using NVIDIA NGC PyTorch Docker container, Intel Core i9-9900K CPU, and NVIDIA RTX 2080 TI GPU.

$ docker run -it --rm --ipc=host --gpus all -v $(pwd):/mnt nvcr.io/nvidia/pytorch:21.08-py3

Benchmarks for Custom Convolution

$ python benchmark_pytorch.py
Latency Measurement Using CPU Timer...
|Synchronization: True | Continuous Measurement: True | Latency: 0.39115 ms| 
|Synchronization: False| Continuous Measurement: True | Latency: 0.24755 ms| 
|Synchronization: True | Continuous Measurement: False| Latency: 0.41779 ms| 
|Synchronization: False| Continuous Measurement: False| Latency: 0.24659 ms| 
Latency Measurement Using CUDA Timer...
|Synchronization: True | Continuous Measurement: True | Latency: 0.37559 ms| 
|Synchronization: False| Continuous Measurement: True | Latency: N/A     ms| 
|Synchronization: True | Continuous Measurement: False| Latency: 0.39968 ms| 
|Synchronization: False| Continuous Measurement: False| Latency: N/A     ms| 
Latency Measurement Using PyTorch Benchmark...
Latency: 0.37773 ms

As expected, the custom convolution layer benchmarks using CPU timer without synchronization underestimate the true PyTorch module latency.

Benchmarks for ResNet50

$ python benchmark_pytorch.py
Latency Measurement Using CPU Timer...
|Synchronization: True | Continuous Measurement: True | Latency: 2.08764 ms| 
|Synchronization: False| Continuous Measurement: True | Latency: 2.08360 ms| 
|Synchronization: True | Continuous Measurement: False| Latency: 2.23062 ms| 
|Synchronization: False| Continuous Measurement: False| Latency: 2.09605 ms| 
Latency Measurement Using CUDA Timer...
|Synchronization: True | Continuous Measurement: True | Latency: 2.08559 ms| 
|Synchronization: False| Continuous Measurement: True | Latency: N/A     ms| 
|Synchronization: True | Continuous Measurement: False| Latency: 2.21680 ms| 
|Synchronization: False| Continuous Measurement: False| Latency: N/A     ms| 
Latency Measurement Using PyTorch Benchmark...
Latency: 2.10802 ms

This time, the ResNet50 benchmarks using CPU timer without synchronization are very close to the one with synchronization. But it does not mean the way we measured the latency was correct. For PyTorch modules that consists of many small CUDA layers, each of which runs very fast on GPU, it is possible that the benchmarks with and without synchronization get very close.

Conclusions

Benchmark PyTorch applications using CPU timer, CUDA timer, or PyTorch Benchmark, and placing the timer outside or inside the iteration loop, are all fine, as long as we don’t forget to synchronize between the CPU thread and the CUDA stream, and we ensure the ways we benchmark are consistent throughout all the experiments.

References