Slow prediction speed for translation model opus-mt-en-ro

919 Views Asked by lenhhoxung At 12 April 2022 at 16:41

I'm using the model Helsinki-NLP/opus-mt-en-ro from huggingface. To produce output, I'm using the following code:

    inputs = tokenizer(
            questions,
            max_length=max_input_length,
            truncation=True,
            return_tensors='pt',
            padding=True).to('cuda')
    translation = model.generate(**inputs)

For small inputs (i.e. the number of sentences in questions), it works fine. However, when the number of sentences increases (e.g., batch size = 128), it is very slow. I have a dataset of 100K examples and I have to produce the output. How to make it faster? (I already checked the usage of GPU and it varies between 25% and 70%).

Update: Following the comment of dennlinger, here is the additional information:

Average question length: Around 30 tokens
Definition of slowness: With a batch of 128 questions, it takes around 25 seconds. So given my dataset of 100K examples, it will take more than 5 hours. I'm using GPU Nvidia V100 (16GB) (hence to('cuda') in the code). I cannot increase the batch size because it results in out of memory error.
I didn't try different parameters, but I know by default, the number of beams equals 1.

Original Q&A

Slow prediction speed for translation model opus-mt-en-ro

There are 0 best solutions below

Related Questions in DEEP-LEARNING

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in MACHINE-TRANSLATION

Trending Questions

Popular # Hahtags

Popular Questions