I have a large collection of documents each consisting of ~ 10 sentences. For each document, I wish to find the sentence that maximises perplexity, or equivalently the loss from a fine-tuned causal LM. I have decided to use Hugging Face and the distilgpt2
model for this purpose. I have 2 problems when trying to do in an efficient (vectorized) fashion:
The tokenizer required padding to work in batch mode, but when computing the loss on padded
input_ids
those pad tokens are contributing to the loss. So the loss of a given sentence depends on the length of the longest sentence in the batch which is clearly wrong.When I pass a batch of input IDs to the model and compute the loss, I get a scalar as it (mean?) pools across the batch. I instead need the loss per item, not the pooled one.
I made a version that operates on a sentence by sentence basis and while correct, it is extremely slow (I want to process ~ 25m sentences total). Any advice?
Minimal example below:
# Init
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("clm-gpu/checkpoint-138000")
segmenter = spacy.load('en_core_web_sm')
# That's the part I need to vectorise, surely within a document (bsize ~ 10)
# and ideally across documents (bsize as big as my GPU can handle)
def select_sentence(sentences):
"""We pick the sentence that maximizes perplexity"""
max_loss, best_index = 0, 0
for i, sentence in enumerate(sentences):
encodings = tokenizer(sentence, return_tensors="pt")
input_ids = encodings.input_ids
loss = lm(input_ids, labels=input_ids).loss.item()
if loss > max_loss:
max_loss = loss
best_index = i
return sentences[best_index]
for document in documents:
sentences = [sentence.text.strip() for sentence in segmenter(document).sents]
best_sentence = select_sentence(sentences)
write(best_sentence)
If the goal is to compute perplexity and then select the sentences, there's a better way to do the perplexity computation without messing around with tokens/models.
Install https://huggingface.co/spaces/evaluate-metric/perplexity:
Then:
[out]:
Q: That's great but how do I use it for a custom model that can't be fetched with
model_id=...
?A: For that lets look under the hood, https://huggingface.co/spaces/evaluate-metric/perplexity/blob/main/perplexity.py
This is how the code initialize the model:
Argh, there's no support for local models!
What if we do some simple changes to the code =)
See Load a pre-trained model from disk with Huggingface Transformers
Technically, if you could load a local model that you can load with:
you can should be able the
model_id
as such after the code change:Opened a pull-request: https://huggingface.co/spaces/evaluate-metric/perplexity/discussions/4