I came across this problem when trying out LLaMa 2 (13B version) on a 8X32GB-GPU server. The pipeline setting is like below:
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.float16,
device_map="auto",
)
sequences = pipeline(
prompt,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=512,
)
And when I run the script, the nvidia-smi is like below and it never changes, which indicates that the GPUs are ignored:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-32GB Off | 00000000:06:00.0 Off | 0 |
| N/A 35C P0 45W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2-32GB Off | 00000000:07:00.0 Off | 0 |
| N/A 35C P0 43W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2-32GB Off | 00000000:0A:00.0 Off | 0 |
| N/A 36C P0 45W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2-32GB Off | 00000000:0B:00.0 Off | 0 |
| N/A 34C P0 41W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2-32GB Off | 00000000:85:00.0 Off | 0 |
| N/A 34C P0 46W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2-32GB Off | 00000000:86:00.0 Off | 0 |
| N/A 35C P0 44W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2-32GB Off | 00000000:89:00.0 Off | 0 |
| N/A 37C P0 43W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2-32GB Off | 00000000:8A:00.0 Off | 0 |
| N/A 34C P0 44W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
How can I set the pipeline to work with multiple GPUs instead of the CPU?
Many thanks.
I tried to specify the exact cuda core for use with the argument device="cuda:0" in transformers.pipeline, and this did enforced the pipeline to use cuda:0 instead of the CPU.
But, LLaMA-2-13b requires more memory than 32GB to run on a single GPU, which is exact the memory of my Tesla V100. So I guess I need to find a way to allocate the workload to mutiple GPUs in order to make LLaMA-2-13b to run.
All solutions I found by googling around tell me that device_map="auto" can automatically allocate the model to different GPUs, which I found is not the case with my work envioronment, as is stated above.