M2 Max GPU utilization steadily dropping while running inference with huggingface distilbert-base-cased

89 Views Asked by At

I got M2 Max (96gb). I am trying to run inference on a column of text from a pandas data frame. I just run it in a loop without bothering to batch them. The model is a fine tuned huggingface distilbert-base-cased. The GPU utilization is around ~50% when it started, but slowly dropped to 1% or less. I thought this may be a heat throttling issue, so I tried to turn on an external fan but it didn't seem to help. So I am not sure if it is indeed a heat issue. But it is excruciatingly slow.

Anyone experienced the same, any pointers to debug what's really going on?

Code sample:

from transformers import AutoTokenizer, TFDistilBertForSequenceClassification
from datasets import load_dataset

imdb = load_dataset('imdb')
sentences = imdb['train']['text'][:500]

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')

for i, sentence in tqdm(enumerate(sentences)):
  inputs = tokenizer(sentence, truncation=True, return_tensors='tf')
  output = model(inputs).logits
  pred = np.argmax(output.numpy(), axis=1)

  if i % 100 == 0:
    print(f"len(input_ids): {inputs['input_ids'].shape[-1]}")

The print are showing:

Metal device set to: Apple M2 Max

systemMemory: 96.00 GB
maxCacheSize: 36.00 GB

3it [00:00, 10.87it/s]
len(input_ids): 391
101it [00:13,  6.38it/s]
len(input_ids): 215
201it [00:34,  4.78it/s]
len(input_ids): 237
301it [00:55,  4.26it/s]
len(input_ids): 256
401it [01:54,  1.12it/s]
len(input_ids): 55
500it [03:40,  2.27it/s]
0

There are 0 best solutions below