Is there any way to load a Hugging Face model in multi GPUs and use those GPUs for inferences as well?
Like, there is this model which can be loaded on a single GPU (default cuda:0) and run for inference as below:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", torch_dtype=torch.float16)
input_context= "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_length=256, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
How should I load and run this model for inference on two or more GPUs using Accelerate or DeepSpeed?
Please keep in mind, this is not meant for training or finetuning a model, just inference related.
Any guidance/help would be highly appreciated, thanks in anticipation!
You may leverage HuggingFace's accelerate to make multi-GPU inference, something like this:
The above example is adapted from LLM Inference on multiple GPUs with Accelerate by Geronimo.
More info: