How to compute sentence level perplexity from hugging face language models?

Question

How to compute sentence level perplexity from hugging face language models?

5.5k Views Asked by pilu At 18 October 2025 at 23:19

I have a large collection of documents each consisting of ~ 10 sentences. For each document, I wish to find the sentence that maximises perplexity, or equivalently the loss from a fine-tuned causal LM. I have decided to use Hugging Face and the distilgpt2 model for this purpose. I have 2 problems when trying to do in an efficient (vectorized) fashion:

The tokenizer required padding to work in batch mode, but when computing the loss on padded input_ids those pad tokens are contributing to the loss. So the loss of a given sentence depends on the length of the longest sentence in the batch which is clearly wrong.
When I pass a batch of input IDs to the model and compute the loss, I get a scalar as it (mean?) pools across the batch. I instead need the loss per item, not the pooled one.

I made a version that operates on a sentence by sentence basis and while correct, it is extremely slow (I want to process ~ 25m sentences total). Any advice?

Minimal example below:

# Init
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("clm-gpu/checkpoint-138000")
segmenter = spacy.load('en_core_web_sm')

# That's the part I need to vectorise, surely within a document (bsize ~ 10)
# and ideally across documents (bsize as big as my GPU can handle)
def select_sentence(sentences):
    """We pick the sentence that maximizes perplexity"""
    max_loss, best_index = 0, 0
    for i, sentence in enumerate(sentences):
        encodings = tokenizer(sentence, return_tensors="pt")
        input_ids = encodings.input_ids
        loss = lm(input_ids, labels=input_ids).loss.item()
        if loss > max_loss:
            max_loss = loss
            best_index = i

    return sentences[best_index]

for document in documents:
    sentences = [sentence.text.strip() for sentence in segmenter(document).sents]
    best_sentence = select_sentence(sentences)
    write(best_sentence)

Original Q&A

There are 1 best solutions below

**alvas** · Accepted Answer

If the goal is to compute perplexity and then select the sentences, there's a better way to do the perplexity computation without messing around with tokens/models.

Install https://huggingface.co/spaces/evaluate-metric/perplexity:

pip install -U evaluate

Then:

perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]

results = perplexity.compute(model_id='gpt2',
                             add_start_token=False,
                             predictions=input_texts)
print(list(results.keys()))

[out]:

>>>['perplexities', 'mean_perplexity']
print(round(results["mean_perplexity"], 2))
>>>646.75
print(round(results["perplexities"][0], 2))
>>>32.25

Q: That's great but how do I use it for a custom model that can't be fetched with `model_id=...`?

A: For that lets look under the hood, https://huggingface.co/spaces/evaluate-metric/perplexity/blob/main/perplexity.py

This is how the code initialize the model:

class Perplexity(evaluate.Metric):
    def _info(self):
        return evaluate.MetricInfo(
            module_type="metric",
            description=_DESCRIPTION,
            citation=_CITATION,
            inputs_description=_KWARGS_DESCRIPTION,
            features=datasets.Features(
                {
                    "predictions": datasets.Value("string"),
                }
            ),
            reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
        )

    def _compute(
        self, predictions, model_id, batch_size: int = 16, add_start_token: bool = True, device=None, max_length=None
    ):
        ...
        model = AutoModelForCausalLM.from_pretrained(model_id)
        model = model.to(device)

        tokenizer = AutoTokenizer.from_pretrained(model_id)
        ...

Argh, there's no support for local models!

What if we do some simple changes to the code =)

See Load a pre-trained model from disk with Huggingface Transformers


class Perplexity(evaluate.Metric):
    def _info(self):
        return evaluate.MetricInfo(
            module_type="metric",
            description=_DESCRIPTION,
            citation=_CITATION,
            inputs_description=_KWARGS_DESCRIPTION,
            features=datasets.Features(
                {
                    "predictions": datasets.Value("string"),
                }
            ),
            reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
        )

    def _compute(
        self, predictions, model_id, batch_size: int = 16, add_start_token: bool = True, device=None, max_length=None, local_file_only: bool = False
    ):
        ...
        model = AutoModelForCausalLM.from_pretrained(model_id, local_files_only=local_file_only)
        model = model.to(device)

        tokenizer = AutoTokenizer.from_pretrained(model_id, local_files_only=local_file_only)

Technically, if you could load a local model that you can load with:

AutoModelForCausalLM.from_pretrained("clm-gpu/checkpoint-138000", local_file_only=True)

you can should be able the model_id as such after the code change:

perplexity.compute(model_id="clm-gpu/checkpoint-138000",
                             add_start_token=False,
                             predictions=input_texts, 
                             local_file_only=True)

Opened a pull-request: https://huggingface.co/spaces/evaluate-metric/perplexity/discussions/4

How to compute sentence level perplexity from hugging face language models?

There are 1 best solutions below

Q: That's great but how do I use it for a custom model that can't be fetched with `model_id=...`?

Argh, there's no support for local models!

Related Questions in PYTHON

Related Questions in NLP

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in LARGE-LANGUAGE-MODEL

Related Questions in HUGGINGFACE-EVALUATE

Trending Questions

Popular # Hahtags

Popular Questions

How to compute sentence level perplexity from hugging face language models?

There are 1 best solutions below

Q: That's great but how do I use it for a custom model that can't be fetched with model_id=...?

Argh, there's no support for local models!

Related Questions in PYTHON

Related Questions in NLP

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in LARGE-LANGUAGE-MODEL

Related Questions in HUGGINGFACE-EVALUATE

Trending Questions

Popular # Hahtags

Popular Questions

Q: That's great but how do I use it for a custom model that can't be fetched with `model_id=...`?