KenLM perplexity weirdness

940 Views Asked by At

I have 96 files each containing ~10K lines of English text (tokenized, downcased). If I loop through the files (essentially doing k-fold cross-validation with k=#files) and build a LM (using bin/lmplz) for 95 and run bin/query on the held out file against it, I see a PPL (including OOVs) of 1.0 every time. But if I run a file against an LM built with all 96 files (so test doc is included in building the LM), I get a PPL of 27.8.

I have more experience with SRILM than KenLM, but I've never seen a perplexity score of 1. Something feels wrong about that. Even if I accepted that and attributed it to the test document's sentences occurring in other training data, it wouldn't explain why when I ensure that the test data is included in the training data I get a higher score. What's going on?

=============================

this also seems strange:

Perplexity including OOVs:  1
Perplexity excluding OOVs:  0.795685
OOVs:   0
0

There are 0 best solutions below