Deploy an quantized encoder decoder model as ensemble on Triton server

327 Views Asked by At

The problem

I am trying to deploy a Machine translation Model from the M2M family in a production setting using the Triton server.

What I have tried so far.

I have exported my model to onnx format and quantized them, and I have the encoder, decoder, and decoder with past models on the onnx format.

I would like to find the optimal way to deploy the model in production.

I was thinking about an ensemble model with this pipeline:

Raw Text -> Preprocessing -> Encoder -> Decoder -> Probability for next Token
                                           ↑                     ↓
                                     Top k Tokens      <-    Beam Search

I already have an ensemble model with the tokenizer and the encoder part. It uses the encoder's quantized version leverage the onnxruntime on triton server and returns the last input state of the model.

What I want

I want to implement the decoder as a Python server on Triton that takes the output of the encoder and other decoder input and starts the generation process using beam search.

  • Load the decoder_quantized, and then call generate on the decoder and return the generated token.

How can I achieve this? I have looked at the code for the generate method on hugginface but too much and I got lost.

Does anyone have an easy way to implement this?

What I can do.

I can also have a TritonModel using a python runtime and make it load the model with the encoder and decoder and then perform the generation. But with this, I will use the advantage of the onnxruntime on triton server.

My question is this optimal/fast compared to the approach I want to describe?

This seems to be the approach used in this repository which seems to work.

0

There are 0 best solutions below