# PyTorch Dynamic Quantization

## Introduction

Dynamic quantization quantize the weights of neural networks to integers, but the activations are dynamically quantized during inference. Comparing to floating point neural networks, the size of dynamic quantized model is much smaller since the weights are stored as low-bitwidth integers. Comparing to other quantization techniques, dynamic quantization does not require any data for calibration or fine-tuning. More details about the mathematical foundations of quantization for neural networks could be found in my article “Quantization for Neural Networks”.

Given a pre-trained floating point model, we could easily create an dynamically quantized model, run inference, and potentially achieve better latency without too much additional effort. In this blog post, I would like to show how to use PyTorch to do dynamic quantizations.

## PyTorch Dynamic Quantization

Unlike TensorFlow 2.3.0 which supports integer quantization using arbitrary bitwidth from 2 to 16, PyTorch 1.7.0 only supports 8-bit integer quantization. The workflow is as easy as loading a pre-trained floating point model and apply a dynamic quantization wrapper.

In this case, I would like to use the BERT-QA model from HuggingFace Transformers as an example. I was dynamically quantizing the torch.nn.Linear layer for the BERT-QA model since the majority of the computation for Transformer based models are matrix multiplications. The source code could also be downloaded from GitHub.

With PyTorch 1.7.0, we could do dynamic quantization using x86-64 and aarch64 CPUs. However, NVIDIA GPUs have not been supported for PyTorch dynamic quantization yet.

We could see that the model size of the INT8 quantized model is much smaller than the FP32 model. The inference latency of INT8 dynamic quantization on CPU is much faster than the FP32 ordinary inference on CPU. However, FP32 inference using NVIDIA GPU is still the fastest.

Lei Mao

11-14-2020

04-29-2021