I am using the INT4 quantized version of Llama-2 13B to run inference on the T4 GPU in Google Colab.
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import accelerate
model_name = 'Intel/Llama-2-13b-chat-hf-onnx-int4'
device = 'cuda:0' if torch.cuda.is_available() else 'cpu' # device is 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = ORTModelForCausalLM.from_pretrained(model_name,
use_cache=False, use_io_binding=False, device_map='auto')
model.to(device)
def chat(model, tokenizer, device, prompt, **kwargs):
inputs = tokenizer(prompt, return_tensors='pt').to(device)
generate_ids = model.generate(inputs.input_ids, **kwargs)
return tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
prompt = 'How do I mark an email as spam in gmail?'
response = chat(model, tokenizer, device, prompt,
max_new_tokens=20, do_sample=False)
print(response)
But this is too slow, even generating only 20 tokens (where the model barely starts giving the answer) takes about 15 minutes. I am not running into any resource limitations based on the metrics provided by colab -- RAM: 4.4/12.7 GB; GPU MEM: 5.6/15 GB; Disk: 35.7/78.2 GB
I'm pretty sure the model is running on the GPU, because when I don't do model.to(device)
, I get a warning saying my tokens and model are on different devices. I also tried to set do_sample=False
in hope of reducing compute, but that didn't make any noticeable difference in the inference speed.
Is this just how slow the model actually is?
P.S., I also tried to run the regular non-quantized version of Llama-2 7B (13B wouldn't fit in mem) using transformers.AutoModelForCausalLM
, and had similar inference speed (~30-60 seconds per token).