Inference INT4 ONNX version of LLAMA-2 very slow on google colab

433 Views Asked by At

I am using the INT4 quantized version of Llama-2 13B to run inference on the T4 GPU in Google Colab.

from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import accelerate

model_name = 'Intel/Llama-2-13b-chat-hf-onnx-int4'
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'  # device is 'cuda:0'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = ORTModelForCausalLM.from_pretrained(model_name,
    use_cache=False, use_io_binding=False, device_map='auto')

model.to(device)

def chat(model, tokenizer, device, prompt, **kwargs):
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    generate_ids = model.generate(inputs.input_ids, **kwargs)
    return tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

prompt = 'How do I mark an email as spam in gmail?'
response = chat(model, tokenizer, device, prompt,
    max_new_tokens=20, do_sample=False)
print(response)

But this is too slow, even generating only 20 tokens (where the model barely starts giving the answer) takes about 15 minutes. I am not running into any resource limitations based on the metrics provided by colab -- RAM: 4.4/12.7 GB; GPU MEM: 5.6/15 GB; Disk: 35.7/78.2 GB

I'm pretty sure the model is running on the GPU, because when I don't do model.to(device), I get a warning saying my tokens and model are on different devices. I also tried to set do_sample=False in hope of reducing compute, but that didn't make any noticeable difference in the inference speed.

Is this just how slow the model actually is?

P.S., I also tried to run the regular non-quantized version of Llama-2 7B (13B wouldn't fit in mem) using transformers.AutoModelForCausalLM, and had similar inference speed (~30-60 seconds per token).

0

There are 0 best solutions below