Nsight Systems in Docker

Introduction

NVIDIA Nsight Systems is a low overhead performance analysis tool designed to provide developers need to optimize their software. Unbiased activity data is visualized within the tool to help users investigate bottlenecks, avoid inferring false-positives, and pursue optimizations with higher probability of performance gains.

In this blog post, I would like to discuss about how to install and use Nsight Systems in Docker container so that we could use it anywhere that has Docker installed.

Nsight Systems

Build Docker Image

It is possible to install Nsight Systems inside a Docker image and used it anywhere. The Dockerfile for building Nsight Systems is as follows.

nsight-systems.Dockerfile
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
FROM nvcr.io/nvidia/cudagl:11.4.2-devel-ubuntu20.04

ARG GIT_USER_EMAIL="dukeleimao@gmail.com"
ARG GIT_USER_NAME="Lei Mao"

ENV DEBIAN_FRONTEND noninteractive

# Install package dependencies
RUN apt-get update -y && \
apt-get install -y --no-install-recommends \
apt-transport-https \
ca-certificates \
dbus \
fontconfig \
gnupg \
libasound2 \
libfreetype6 \
libglib2.0-0 \
libnss3 \
libsqlite3-0 \
libx11-xcb1 \
libxcb-glx0 \
libxcb-xkb1 \
libxcomposite1 \
libxcursor1 \
libxdamage1 \
libxi6 \
libxml2 \
libxrandr2 \
libxrender1 \
libxtst6 \
openssh-client \
wget \
xcb \
xkb-data && \
apt-get clean

RUN apt-get update -y && \
apt-get install -y qt5-default cuda-nsight-systems-11-4

To build the Docker image, please run the following command.

1
$ docker build -f nsight-systems.Dockerfile --no-cache --tag=nsight-systems:11.4 .

Run Docker Container

To run the Docker container, please run the following command.

1
2
3
$ xhost +
$ docker run -it --rm --gpus all -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix --cap-add=SYS_ADMIN --security-opt seccomp=unconfined -v $(pwd):/mnt --network=host nsight-systems:11.4
$ xhost -

Run Nsight Systems

To run Nsight Systems with GUI, please run the following command.

1
$ nsys-ui

We could now profile the applications from the Docker container, from the Docker local host machine via Docker mount, and from the remote host such as a remote workstation or an embedding device.

Examples

Pageable Memory VS Page-Locked Memory

To overlap data transfer and kernel launch with CUDA stream, we will have to use page-locked (pinned) host memory. Otherwise, with pageable memory, no data transfer and kernel launch overlap will happen.

I prepared two examples trying to use CUDA stream to overlap data transfer and kernel launch. One uses page-locked host memory and the other one uses pageable host memory. The two examples are available on GitHub.

Using Nsight Systems to profile the two implementations, we could clearly see that there are no data transfer and kernel launch overlap from the implementation that does not use page-locked memory. Based on this, we realized that we made a mistake or there could be optimization opportunities.

No Data Transfer Overlap with Non-Pinned Host Memory

By switching to page-locked memory, we could see data transfer and kernel launch overlap.

Data Transfer Overlap with Pinned Host Memory

GitHub

All the Dockerfiles and examples are available on GitHub.

Miscellaneous

NVIDIA Nsight Compute is an interactive specialized kernel profiler for CUDA applications. So for optimizing CUDA kernel implementation, we should use Nsight Compute instead of Nsight Systems. Nsight Compute could be installed and used in Docker container similarly as Nsight Systems.

References

Author

Lei Mao

Posted on

06-01-2022

Updated on

06-01-2022

Licensed under


Comments