I have 96 files each containing ~10K lines of English text (tokenized, downcased). If I loop through the files (essentially doing k-fold cross-validation with k=#files) and build a LM (using bin/lmplz) for 95 and run bin/query on the held out file against it, I see a PPL (including OOVs) of 1.0 every time. But if I run a file against an LM built with all 96 files (so test doc is included in building the LM), I get a PPL of 27.8.
I have more experience with SRILM than KenLM, but I've never seen a perplexity score of 1. Something feels wrong about that. Even if I accepted that and attributed it to the test document's sentences occurring in other training data, it wouldn't explain why when I ensure that the test data is included in the training data I get a higher score. What's going on?
=============================
this also seems strange:
Perplexity including OOVs: 1
Perplexity excluding OOVs: 0.795685
OOVs: 0