language model with SRILM

1.8k Views Asked by At

I'm trying to build a language model using SRILM. I have a list of phrases and I create the model using:

./ngram-count -text corpus.txt -order 3 -ukndiscount -interpolate -unk -lm corpus.lm

After this I tried to make some example to see the probabilities of different phrases and it turned out that has a log probability of -0.9.

The problem is that there are some words in the training with a lower log probability. For example there are 5 "abatantuono" and its log probability is -4.8.

I think this is strange because a phrase <s> <unk> </s> is more probable than <s> abatantuono </s> and in the training set the 3-gram <s> abatantuono </s> is also present!

This can be seen here:

 % ./ngram -lm corpus.lm -ppl ../../../corpus.txt.test -debug 2 -unk
 reading 52147 1-grams
 reading 316818 2-grams
 reading 91463 3-grams
 abatantuono
     p( abatantuono | <s> )     = [2gram] 1.6643e-05 [ -4.77877 ]
     p( </s> | abatantuono ...)     = [3gram] 0.717486 [ -0.144186 ]
 1 sentences, 1 words, 0 OOVs
 0 zeroprobs, logprob= -4.92296 ppl= 289.386 ppl1= 83744.3

 abatantonno
     p( <unk> | <s> )   = [1gram] 0.00700236 [ -2.15476 ]
     p( </s> | <unk> ...)   = [1gram] 0.112416 [ -0.949172 ]
 1 sentences, 1 words, 0 OOVs
 0 zeroprobs, logprob= -3.10393 ppl= 35.6422 ppl1= 1270.36

 file ../../../corpus.txt.test: 2 sentences, 2 words, 0 OOVs
 0 zeroprobs, logprob= -8.02688 ppl= 101.56 ppl1= 10314.3

What do you think the problem could be?

Thank you

1

There are 1 best solutions below

0
On BEST ANSWER

This is a flagged problem of SRILM (see Kenneth Heafield's thesis - footnote on page 30 and his website notes on SRILM). The way the mass is allocated to unknown word can assign them a higher probability compared to the seen rare words in the training data. You can have a look at KenLM package which has only the implementation of Modified Kneser-Ney (generally performs better than Kneser-Ney smoothing) but does the mass allocation to unknown words in a way that prevents this from happening.