Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

Dynamic quantization quantize the weights of neural networks to integers, but the activations are dynamically quantized during inference. Comparing to floating point neural networks, the size of dynamic quantized model is much smaller since the weights are stored as low-bitwidth integers. Comparing to other quantization techniques, dynamic quantization does not require any data for calibration or fine-tuning. More details about the mathematical foundations of quantization for neural networks could be found in my article “Quantization for Neural Networks”.


Given a pre-trained floating point model, we could easily create an dynamically quantized model, run inference, and potentially achieve better latency without too much additional effort. In this blog post, I would like to show how to use PyTorch to do dynamic quantizations.

PyTorch Dynamic Quantization

Unlike TensorFlow 2.3.0 which supports integer quantization using arbitrary bitwidth from 2 to 16, PyTorch 1.7.0 only supports 8-bit integer quantization. The workflow is as easy as loading a pre-trained floating point model and apply a dynamic quantization wrapper.


In this case, I would like to use the BERT-QA model from HuggingFace Transformers as an example. I was dynamically quantizing the torch.nn.Linear layer for the BERT-QA model since the majority of the computation for Transformer based models are matrix multiplications. The source code could also be downloaded from GitHub.

# qa.py

import os
import time
import torch
from transformers import BertTokenizer, BertForQuestionAnswering

def measure_inference_latency(model, inputs, num_samples=100):

    start_time = time.time()
    for _ in range(num_samples):
        _ = model(**inputs)
    end_time = time.time()
    elapsed_time = end_time - start_time
    elapsed_time_ave = elapsed_time / num_samples

    return elapsed_time_ave

def get_bert_qa_model(model_name="deepset/bert-base-cased-squad2", cache_dir="./saved_models"):

    # https://huggingface.co/transformers/model_doc/bert.html#transformers.BertForQuestionAnswering
    tokenizer = BertTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
    model = BertForQuestionAnswering.from_pretrained(model_name, cache_dir=cache_dir, return_dict=True)

    return model, tokenizer

def prepare_qa_inputs(question, text, tokenizer, device=None):

    inputs = tokenizer(question, text, return_tensors="pt")
    if device is not None:
        inputs_cuda = dict()
        for input_name in inputs.keys():
            inputs_cuda[input_name] = inputs[input_name].to(device)
        inputs = inputs_cuda
    
    return inputs

def move_inputs_to_device(inputs, device=None):

    inputs_cuda = dict()
    for input_name in inputs.keys():
        inputs_cuda[input_name] = inputs[input_name].to(device)

    return inputs_cuda

def run_qa(model, tokenizer, question, text, device=None):

    inputs = prepare_qa_inputs(question=question, text=text, tokenizer=tokenizer)

    all_tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].numpy()[0])

    if device is not None:
        inputs = move_inputs_to_device(inputs, device=device)
        model = model.to(device)

    outputs = model(**inputs)

    start_scores = outputs.start_logits
    end_scores = outputs.end_logits

    answer_start_idx = torch.argmax(start_scores, 1)[0]
    answer_end_idx = torch.argmax(end_scores, 1)[0] + 1

    answer = " ".join(all_tokens[answer_start_idx : answer_end_idx])

    return answer

def get_model_size(model, temp_dir="/tmp"):

    model_dir = os.path.join(temp_dir, "temp")
    torch.save(model.state_dict(), model_dir)
    # model.save_pretrained(model_dir)
    size = os.path.getsize(model_dir)
    os.remove(model_dir)
    
    return size

def main():

    cuda_device = torch.device("cuda:0")
    num_samples = 100

    model, tokenizer = get_bert_qa_model(model_name="deepset/bert-base-cased-squad2")
    model.eval()
    # https://pytorch.org/docs/stable/torch.quantization.html?highlight=torch%20quantization%20quantize_dynamic#torch.quantization.quantize_dynamic
    quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

    print("=" * 75)
    print("Model Sizes")
    print("=" * 75)

    model_size = get_model_size(model=model)
    quantized_model_size = get_model_size(model=quantized_model)

    print("FP32 Model Size: {:.2f} MB".format(model_size / (2 ** 20)))
    print("INT8 Model Size: {:.2f} MB".format(quantized_model_size / (2 ** 20)))

    question = "What publication printed that the wealthiest 1% have more money than those in the bottom 90%?"

    text = "According to PolitiFact the top 400 richest Americans \"have more wealth than half of all Americans combined.\" According to the New York Times on July 22, 2014, the \"richest 1 percent in the United States now own more wealth than the bottom 90 percent\". Inherited wealth may help explain why many Americans who have become rich may have had a \"substantial head start\". In September 2012, according to the Institute for Policy Studies, \"over 60 percent\" of the Forbes richest 400 Americans \"grew up in substantial privilege\"."

    inputs = prepare_qa_inputs(question=question, text=text, tokenizer=tokenizer)
    answer = run_qa(model=model, tokenizer=tokenizer, question=question, text=text)
    answer_quantized = run_qa(model=quantized_model, tokenizer=tokenizer, question=question, text=text)

    print("=" * 75)
    print("BERT QA Example")
    print("=" * 75)

    print("Text: ")
    print(text)
    print("Question: ")
    print(question)
    print("Model Answer: ")
    print(answer)
    print("Dynamic Quantized Model Answer: ")
    print(answer_quantized)

    print("=" * 75)
    print("BERT QA Inference Latencies")
    print("=" * 75)

    model_latency = measure_inference_latency(model=model, inputs=inputs, num_samples=num_samples)
    print("CPU Inference Latency: {:.2f} ms / sample".format(model_latency * 1000))

    quantized_model_latency = measure_inference_latency(model=quantized_model, inputs=inputs, num_samples=num_samples)
    print("Dynamic Quantized CPU Inference Latency: {:.2f} ms / sample".format(quantized_model_latency * 1000))

    inputs_cuda = move_inputs_to_device(inputs, device=cuda_device)
    model.to(cuda_device)
    model_cuda_latency = measure_inference_latency(model=model, inputs=inputs_cuda, num_samples=num_samples)
    print("CUDA Inference Latency: {:.2f} ms / sample".format(model_cuda_latency * 1000))

    # No CUDA backend for dynamic quantization in PyTorch 1.7.0
    # quantized_model_cuda = quantized_model.to(cuda_device)
    # quantized_model_cuda_latency = measure_inference_latency(model=quantized_model_cuda, inputs=inputs_cuda, num_samples=num_samples)
    # print("Dynamic Quantized GPU Inference Latency: {:.2f} ms / sample".format(quantized_model_cuda_latency * 1000))

if __name__ == "__main__":

    main()

With PyTorch 1.7.0, we could do dynamic quantization using x86-64 and aarch64 CPUs. However, NVIDIA GPUs have not been supported for PyTorch dynamic quantization yet.

$ python qa.py 
===========================================================================
Model Sizes
===========================================================================
FP32 Model Size: 411.00 MB
INT8 Model Size: 168.05 MB
===========================================================================
BERT QA Example
===========================================================================
Text: 
According to PolitiFact the top 400 richest Americans "have more wealth than half of all Americans combined." According to the New York Times on July 22, 2014, the "richest 1 percent in the United States now own more wealth than the bottom 90 percent". Inherited wealth may help explain why many Americans who have become rich may have had a "substantial head start". In September 2012, according to the Institute for Policy Studies, "over 60 percent" of the Forbes richest 400 Americans "grew up in substantial privilege".
Question: 
What publication printed that the wealthiest 1% have more money than those in the bottom 90%?
Model Answer: 
New York Times
Dynamic Quantized Model Answer: 
New York Times
===========================================================================
BERT QA Inference Latencies
===========================================================================
CPU Inference Latency: 78.91 ms / sample
Dynamic Quantized CPU Inference Latency: 47.83 ms / sample
CUDA Inference Latency: 10.40 ms / sample

We could see that the model size of the INT8 quantized model is much smaller than the FP32 model. The inference latency of INT8 dynamic quantization on CPU is much faster than the FP32 ordinary inference on CPU. However, FP32 inference using NVIDIA GPU is still the fastest.

References