How to compute a simple maximum likelihood LM with SRILM

37 Views Asked by At

I want to use a simple maximum likelihood (i.e. p(w|w_history) = c(w_history, w)/c(w_history), nothing else) language model without any tricks like smoothing. I am using a small corpus on purpose, to check that the computed numbers match with the ones I get from calculating by hand.

$ cat abcd.txt
a b c d

I know that I SRILM smoothes by default and therefore need to pass -addsmooth 0

As command I use:

ngram-count -order 3 -text abcd.txt -lm abcd.arpa -addsmooth 0 -write abcd.counts

The counts file is true to my expectations:

<s> 1
<s> a   1
<s> a b 1
a   1
a b 1
a b c   1
b   1
b c 1
b c d   1
c   1
c d 1
c d </s>    1
d   1
d </s>  1
</s>    1

but the generated language model does not:


\data\
ngram 1=6
ngram 2=5
ngram 3=0

\1-grams:
-0.69897    </s>
-99 <s> -99
-0.69897    a   -99
-0.69897    b   -99
-0.69897    c   -99
-0.69897    d   -99

\2-grams:
0   <s> a
0   a b
0   b c
0   c d
0   d </s>

\3-grams:

\end\

  1. there are no 3-grams listed, despite them being present in the counts. Is that maybe because they do not occur more that a certain threshold? I see no indication of that behavior on the manpage. Is there a way to get the 3-grams computed as well?

  2. it seems that backoff weights are assigned. Can this be suppressed?

0

There are 0 best solutions below