How to run half precision inference on a TensorRT model, written with TensorRT C++ API?

1.8k Views Asked by bfra At 17 August 2025 at 13:58

I'm trying to run half precision inference with a model natively written in TensorRT C++ API (not parsed from other frameworks e.g. caffe, tensorflow); To the best of my knowledge, there is no public working example of this problem; the closest thing I found is the sampleMLP sample code, released with TensorRT 4.0.0.3, yet the release notes say there is no support for fp16;

My toy example code can be found in this repo. It contains API-implemented architecture and inference routine, plus the python script I use to convert my dictionary of trained weights to the wtd TensorRT format.

My toy architecture only consists of one convolution; the goal is to obtain similar results between fp32 and fp16, except for some reasonable loss of precision; the code seems to work with fp32, whereas what I obtain in case of fp16 inferencing are values of totally different orders of magnitude (~1e40); so it looks like I'm doing something wrong during conversions;

I'd appreciate any help in understanding the problem.

Thanks,

Original Q&A

There are 1 best solutions below

Terna K On 12 September 2018 at 20:24

After quickly reading through your code, I can see you did more than is necessary to get a half precision optimized network. You shouldn't manually convert the loaded weights from float32 to float16 yourself. Instead, you should create your network as you normally would and call nvinfer1::IBuilder::setFp16Mode(true) with your nvinfer1::IBuilder object to let TensorRT do the conversions for you where suitable.

How to run half precision inference on a TensorRT model, written with TensorRT C++ API?

There are 1 best solutions below

Related Questions in C++

Related Questions in DEEP-LEARNING

Related Questions in PRECISION

Related Questions in INFERENCE-ENGINE

Related Questions in TENSORRT

Trending Questions

Popular # Hahtags

Popular Questions