# PyTorch Eager Mode Quantization TensorRT Acceleration

## Introduction

As of PyTorch 2.3.0, PyTorch has three quantization interfaces: eager mode quantization, FX graph mode quantization and PyTorch 2 Export quantization. Because the latest PyTorch 2 Export quantization interface prevented the quantized PyTorch model from exporting to ONNX, it’s not quite possible to accelerate the model inference using TensorRT without developing a custom quantization backend for PyTorch FX graph (fx2trt is coming soon). Both eager mode quantization and FX graph mode quantization interfaces allows the quantized PyTorch model to be exported to ONNX, which can be further optimized and accelerated using TensorRT. While the FX graph mode quantization interface is more flexible and powerful, sometimes using the eager mode quantization interface is inevitable for some use cases.

In this post, I would like to show how to accelerate the quantized PyTorch model from the PyTorch eager mode quantization interface using TensorRT. The same approach can also be applied to the quantized PyTorch model from the PyTorch FX graph mode quantization interface, since both quantized PyTorch models can be exported to ONNX.

## PyTorch Eager Mode Quantization TensorRT Acceleration

The TensorRT acceleration for the quantized PyTorch model from the PyTorch eager mode quantization interface involves three steps:

- Perform PyTorch eager mode quantization on the floating-point PyTorch model in PyTorch and export the quantized PyTorch model to ONNX.
- Fix the quantized ONNX model graph so that it can be parsed by the TensorRT parser.
- Build the quantized ONNX model to a TensorRT engine, profile the performance, and verify the accuracy.

The source code for this post can be found on GitHub.

### TensorRT INT8 Quantization Requirements

TensorRT INT8 explicit quantization requires per-channel symmetric quantization for weights and per-tensor symmetric quantization for activations in a quantized model. Therefore, when performing the post-training static quantization calibration or quantization aware training in PyTorch, it’s important to make sure the quantization configuration meets the TensorRT INT8 quantization requirements.

### PyTorch Eager Mode Quantization

Slightly different from the PyTorch static quantization recipe I posted previously, which is only applicable for CPU inference, this time, the quantization recipe uses per-channel symmetric quantization for weights and per-tensor symmetric quantization for activations, and the PyTorch quantization backend is set to `qnnpack`

instead of `fbgemm`

because `fbgemm`

does not support INT8 symmetric quantization inference very well and thus will prevent the model from being exported to ONNX by tracing.

1 | torch.backends.quantized.engine = 'qnnpack' |

### Quantized ONNX Model Graph Surgery

The exported quantized ONNX model from PyTorch eager mode quantization has some bugs and issues with the TensorRT parser. Therefore, we will need to fix the quantized ONNX model before building it to a TensorRT engine.

More specifically, there will be a `Cast`

node inserted between `QuantizeLinear`

and `DequantizeLinear`

nodes in the quantized ONNX model graph and the cast data type is `uint8`

instead of `int8`

, despite the fact that we have explicitly set the quantization configuration to `torch.qint8`

for activations. Thus these incorrect `Cast`

nodes need to be removed from the quantized ONNX model graph.

In addition, the floating-point bias term for the `Conv`

node cannot be parsed by TensorRT parser. It will need to be computed and added to the quantized ONNX model graph as a constant tensor.

Before the quantized ONNX model graph surgery, the quantized ONNX model graph looks like this.

After the quantized ONNX model graph surgery, the quantized ONNX model graph will look like this.

There might still be a few places where this quantized ONNX model graph can be improved for optimum TensorRT performance, such as fusing the skip connection add with the one of the previous convolution layers by removing the `QuantizeLinear`

and `DequantizeLinear`

between the `Conv`

and `Add`

nodes. It’s a little bit tricky to do using PyTorch eager mode quantization as mentioned previously and it’s not covered in this article. For optimum TensorRT quantized engine performance, please check TensorRT Q/DQ placement recommendations and probably also use NVIDIA PyTorch Quantization Toolkit.

### Build and Verify Quantized TensorRT Engine

The floating-point PyTorch ResNet18 model and the INT8-quantized PyTorch ResNet18 model have an accuracy of 0.854 and 0.852, respectively, on the CIFAR10 test dataset. The quantized TensorRT engine has an accuracy of 0.851 on the CIFAR10 test dataset, which is consistent with the quantized PyTorch model.

For batch size of 1 and an input image size of 32 x 32, the FP16 and INT8 ResNet18 TensorRT engines have an inference latency of 0.208177 ms and 0.17584 ms, respectively. Despite the math utilization of the inference is low because of the small batch size and small image size, the INT8-quantized ResNet18 engine still has a 1.2x latency improvement comparing to the FP16 ResNet18 engine. If the batch size becomes larger and the image size becomes larger, the latency improvement could be more significant.

## References

PyTorch Eager Mode Quantization TensorRT Acceleration

https://leimao.github.io/blog/PyTorch-Eager-Mode-Quantization-TensorRT-Acceleration/