Onnx model converted with tf2onnx runs on CPU only in python

477 Views Asked by At

I'm trying to use the model from the repository (on google drive) with onnx instead of tensorflow.
I converted it with:
python3 -m tf2onnx.convert --graphdef mars-small128.pb --output mars-small128_nchw.onnx --inputs-as-nchw "images:0" --inputs "images:0" --outputs-as-nchw "features:0" --outputs "features:0"

Now I try to use it this way:

class ImageEncoder(object):

    def __init__(self, checkpoint_filename):
        self.image_shape=[128, 64]
        
        self.session = ort.InferenceSession(checkpoint_filename, providers=['CUDAExecutionProvider'])
        print(f"the result of get_provider is: {self.session.get_providers()}")
    def __call__(self, data_x):
        out = self.session.run(["features:0"], {'images:0':data_x})[0]
        return out

print gives me the result of get_provider is: ['CUDAExecutionProvider', 'CPUExecutionProvider']

But I see that my CPU load raises to 100% on all cores(!) and my gpu load in nvidia-smi stays low.

>>> onnx.__version__
'1.11.0'
>>> onnxruntime.__version__
'1.10.0'
>>> tf2onnx.__version__
'1.12.1'

cuda_11.0

1

There are 1 best solutions below

0
On

Setting CUDAExecutionProvider for ort.InferenceSession does not guarantee that the entire model will be executed on the GPU. Some operations may not have an implementation for CUDA - when creating the inference session, onnxruntime will assign those operations to be executed on the CPU. You can cross-check operations present in the model with ONNX Runtime Supported Operators for any that aren't listed under CUDAExecutionProvider.

Another possible explanation is that your graphics processor simply doesn't have much to do. Running the model always comes with some CPU overhead, and your pipeline probably has some pre- and postprocessing steps as well. Now your input images are rather small, only [64, 128]. If the model isn't overwhelmingly deep, any modern GPU will execute it very fast, resulting in your GPU mostly just waiting for inputs and thus a low load.

To try and resolve the issue, I recommend first profiling your script. If most of the execution time is outside of InferenceSession.run - there's your answer. If this is the case, increasing batch size could help by reducing the CPU overhead. If running the model is the bottleneck, you can profile the model execution itself with

options = onnxruntime.SessionOptions()
options.enable_profiling = True
session = onnxruntime.InferenceSession(
        'model.onnx',
        sess_options=options,
        providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
)
session.run(None, batch)
session.end_profiling()

and inspect the generated json file with onnxruntime perf view. This will show the runtimes of executed operations grouped by providers, and you can use the information to try and modify the original model to make it run on CUDA instead.