I want to use a simple maximum likelihood (i.e. p(w|w_history) = c(w_history, w)/c(w_history), nothing else) language model without any tricks like smoothing. I am using a small corpus on purpose, to check that the computed numbers match with the ones I get from calculating by hand.
$ cat abcd.txt
a b c d
I know that I SRILM smoothes by default and therefore need to pass -addsmooth 0
As command I use:
ngram-count -order 3 -text abcd.txt -lm abcd.arpa -addsmooth 0 -write abcd.counts
The counts file is true to my expectations:
<s> 1
<s> a 1
<s> a b 1
a 1
a b 1
a b c 1
b 1
b c 1
b c d 1
c 1
c d 1
c d </s> 1
d 1
d </s> 1
</s> 1
but the generated language model does not:
\data\
ngram 1=6
ngram 2=5
ngram 3=0
\1-grams:
-0.69897 </s>
-99 <s> -99
-0.69897 a -99
-0.69897 b -99
-0.69897 c -99
-0.69897 d -99
\2-grams:
0 <s> a
0 a b
0 b c
0 c d
0 d </s>
\3-grams:
\end\
there are no 3-grams listed, despite them being present in the counts. Is that maybe because they do not occur more that a certain threshold? I see no indication of that behavior on the manpage. Is there a way to get the 3-grams computed as well?
it seems that backoff weights are assigned. Can this be suppressed?