I'm using the model Helsinki-NLP/opus-mt-en-ro from huggingface. To produce output, I'm using the following code:
inputs = tokenizer(
questions,
max_length=max_input_length,
truncation=True,
return_tensors='pt',
padding=True).to('cuda')
translation = model.generate(**inputs)
For small inputs (i.e. the number of sentences in questions
), it works fine. However, when the number of sentences increases (e.g., batch size = 128), it is very slow.
I have a dataset of 100K examples and I have to produce the output. How to make it faster? (I already checked the usage of GPU and it varies between 25% and 70%).
Update: Following the comment of dennlinger, here is the additional information:
- Average question length: Around 30 tokens
- Definition of slowness: With a batch of 128 questions, it takes around 25 seconds. So given my dataset of 100K examples, it will take more than 5 hours. I'm using GPU Nvidia V100 (16GB) (hence
to('cuda')
in the code). I cannot increase the batch size because it results inout of memory
error. - I didn't try different parameters, but I know by default, the number of beams equals 1.