Perplexity metric for GPT2 model is lower for non-English text

87 Views Asked by At

I am currently working on a project of calculating perplexities of various causal LLMs for different languages to estimate their behaviour if there is an input in a form of the language, that respective LLM was not trained on.

However, I faced the point, that perplexity metric shows lower results if the input is a different language, not the one the model was trained on. What is the theoretical background of it?

I expected the perplexity be lower for a language the model was trained on than for the, so to say, out-of-distribution language. Would be great if you can help me with that.

To make a more detailed explanation, here is the implementation of Hugging Face team "evaluate" library, that literally give such example in the documentation:

import evaluate
perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
results = perplexity.compute(model_id='gpt2',
                             add_start_token=False,
                             predictions=input_texts)

print(results)



>> {'perplexities': [32.25198745727539, 1499.620361328125, 408.2679748535156], 'mean_perplexity': 646.713441212972}

Therefore, it seems like the perplexity is the lowest for "lorem ipsum" sentence, while GPT2 is trained on English texts and should have the lowest value instead of the highest value in this case.

Thank you for your help!

0

There are 0 best solutions below