ONNX Runtime C++ Inference
ONNX is the open standard format for neural network model interoperability. It also has an ONNX Runtime that is able to execute the neural network model using different execution providers, such as CPU, CUDA, TensorRT, etc. While there has been a lot of examples for running inference using ONNX Runtime Python APIs, the examples using ONNX Runtime C++ APIs are quite limited.
In this blog post, I would like discuss how to do image processing using OpenCV C++ APIs and run inference using ONNX Runtime C++ APIs.
In this example, I used the public SqueezeNet ONNX model and royalty-free images from Pixabay. The implementation, however, should also be compatible with most of the ImageNet classification neural networks and images from other sources with slight modifications. In addition, I also compared the inference latencies measured from the CPU and CUDA execution providers.
The implementation and the Docker container are available from the GitHub.
In this example, we used OpenCV for image processing and ONNX Runtime for inference. The C++ headers and libraries for OpenCV and ONNX Runtime are usually not available in the system or a well-maintained Docker container. We would have to build OpenCV and ONNX Runtime from source and install. OpenCV and ONNX Runtime do support CUDA. So we would have to build the CUDA components for at least ONNX Runtime. The build takes very long time and I recommend to use the prepared Dockerfile to build a Docker container instead of building the library manually.
The image processing process using OpenCV C++ APIs is not as straightforward as using OpenCV Python APIs. We would have to
- Read an image in HWC BGR UINT8 format.
- Resize the image.
- Convert the image to HWC RGB UINT8 format.
- Convert the image to HWC RGB float format by dividing each pixel by 255.
- Split the RGB channels from the image.
- Normalize each channel.
- Merge the RGB channels back to the image.
- Convert the image to CHW RGB float format.
The implementation looks as follows.
cv::Mat imageBGR = cv::imread(imageFilepath, cv::ImreadModes::IMREAD_COLOR);
To run inference using ONNX Runtime, the user is responsible for creating and managing the input and output buffers. These buffers could be created and managed via
std::vector. The linear-format input data should be copied to the buffer for ONNX Runtime inference.
size_t inputTensorSize = vectorProduct(inputDims);
Once the buffers were created, they would be used for creating instances of
Ort::Value which is the tensor format for ONNX Runtime. There could be multiple inputs for a neural network, so we have to prepare an array of
Ort::Value instances for inputs and outputs respectively even if we only have one input and one output.
Creating ONNX Runtime inference sessions, querying input and output names, dimensions, and types are trivial, and I will skip these here.
To run inference, we provide the run options, an array of input names corresponding to the the inputs in the input tensor, an array of input tensor, number of inputs, an array of output names corresponding to the the outputs in the output tensor, an array of output tensor, number of outputs.
The inference result could be found in the buffer for the output tensors, which are usually the buffer from
We feeded a bee eater image to the neural network, and run the inference using CPU and CUDA execution providers.
$ ./inference --use_cpu
$ ./inference --use_cuda
The ONNX Runtime inference implementation has successfully classify the bee eater image as bee eater with high confidence. The inference latency using CUDA is 0.98 ms on an NVIDIA RTX 2080TI GPU whereas the inference latency using CPU is 7.45 ms on an Intel i9-9900K CPU.
Using TensorRT execution provider might result in even better inference latency. However, I did not measure it because creating a correct Docker container and build the correct libraries are very tedious.
ONNX Runtime C++ Inference