I'm trying to use the model from the repository (on google drive) with onnx instead of tensorflow.
I converted it with:
python3 -m tf2onnx.convert --graphdef mars-small128.pb --output mars-small128_nchw.onnx --inputs-as-nchw "images:0" --inputs "images:0" --outputs-as-nchw "features:0" --outputs "features:0"
Now I try to use it this way:
class ImageEncoder(object):
def __init__(self, checkpoint_filename):
self.image_shape=[128, 64]
self.session = ort.InferenceSession(checkpoint_filename, providers=['CUDAExecutionProvider'])
print(f"the result of get_provider is: {self.session.get_providers()}")
def __call__(self, data_x):
out = self.session.run(["features:0"], {'images:0':data_x})[0]
return out
print
gives me the result of get_provider is: ['CUDAExecutionProvider', 'CPUExecutionProvider']
But I see that my CPU load raises to 100% on all cores(!) and my gpu load in nvidia-smi
stays low.
>>> onnx.__version__
'1.11.0'
>>> onnxruntime.__version__
'1.10.0'
>>> tf2onnx.__version__
'1.12.1'
cuda_11.0
Setting
CUDAExecutionProvider
forort.InferenceSession
does not guarantee that the entire model will be executed on the GPU. Some operations may not have an implementation for CUDA - when creating the inference session,onnxruntime
will assign those operations to be executed on the CPU. You can cross-check operations present in the model with ONNX Runtime Supported Operators for any that aren't listed underCUDAExecutionProvider
.Another possible explanation is that your graphics processor simply doesn't have much to do. Running the model always comes with some CPU overhead, and your pipeline probably has some pre- and postprocessing steps as well. Now your input images are rather small, only
[64, 128]
. If the model isn't overwhelmingly deep, any modern GPU will execute it very fast, resulting in your GPU mostly just waiting for inputs and thus a low load.To try and resolve the issue, I recommend first profiling your script. If most of the execution time is outside of
InferenceSession.run
- there's your answer. If this is the case, increasing batch size could help by reducing the CPU overhead. If running the model is the bottleneck, you can profile the model execution itself withand inspect the generated
json
file with onnxruntime perf view. This will show the runtimes of executed operations grouped by providers, and you can use the information to try and modify the original model to make it run on CUDA instead.