Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs (Not Training or Finetuning)

1.1k Views Asked by At

Is there any way to load a Hugging Face model in multi GPUs and use those GPUs for inferences as well?

Like, there is this model which can be loaded on a single GPU (default cuda:0) and run for inference as below:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", torch_dtype=torch.float16)

input_context= "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_length=256, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

How should I load and run this model for inference on two or more GPUs using Accelerate or DeepSpeed?

Please keep in mind, this is not meant for training or finetuning a model, just inference related.

Any guidance/help would be highly appreciated, thanks in anticipation!

1

There are 1 best solutions below

0
On

You may leverage HuggingFace's accelerate to make multi-GPU inference, something like this:

import torch
from accelerate import Accelerator
from accelerate.utils import gather_object
from transformers import AutoTokenizer, AutoModelForCausalLM

accelerator = Accelerator()

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained(
    "togethercomputer/LLaMA-2-7B-32K",
    device_map={"": accelerator.process_index},
    torch_dtype=torch.float16
)

prompts_all = [
    "prompt 1...",
    "prompt 2..."
]

# sync GPUs and start the timer
accelerator.wait_for_everyone()

# divide the prompt list onto the available GPUs 
with accelerator.split_between_processes(prompts_all) as prompts:
    # store output of generations in dict
    results=dict(outputs=[], num_tokens=0)

    # have each GPU do inference, prompt by prompt
    for prompt in prompts:
        prompt_tokenized=tokenizer(prompt, return_tensors="pt").to("cuda")
        output_tokenized = model.generate(**prompt_tokenized, max_new_tokens=100)[0]

        # remove prompt from output 
        output_tokenized=output_tokenized[len(prompt_tokenized["input_ids"][0]):]

        # store outputs and number of tokens in result{}
        results["outputs"].append( tokenizer.decode(output_tokenized) )
        results["num_tokens"] += len(output_tokenized)

    results=[ results ] # transform to list, otherwise gather_object() will not collect correctly

# collect results from all the GPUs
results_gathered=gather_object(results)

print(results_gathered)

The above example is adapted from LLM Inference on multiple GPUs with Accelerate by Geronimo.

More info:

  1. Accelerate: https://huggingface.co/docs/accelerate/index