English is not my native language , Sorry for any grammatical mistakes.
I saw many documents for add one smoothing in language model, and I still very confused about the variable V in the formula:
P (wi |w_i-1 ) = c(w_i-1 ,wi )+1 / c(w_i-1 )+V
as for this example corpus and I use bigram
<s> John read Moby Dick </s>
<s> Mary read a different book </s>
<s> She read a book by Cher </s>
if i want to calculate any P(wi | w_i-1) . The V will be 11 because the count of combination of [ w_i-1 , w ] is 11 . But I found it does not include the case [w_i-1 , "<"/s">"] (or the V will be 12) Why we do not need to include this case ? Isn't it the case that w_i-1 is in the end of an article or sentence ?
There's a nice tutorial here: https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf
Consider an ngram language model (without smoothing):
In code:
[out]:
With add-one smoothing, aka Laplace smoothing,
where
|V|
is the number of tokens (usually without<s>
and</s>
).So in code:
[out]:
Note:
len(unigram_counts)-2
accounts for removing<s>
and</s>
from the no. of words in vocabulary.The above it the how.
Q: Why doesn't the
|V|
takes into account<s>
and</s>
?A: One possible reason is because we never consider empty sentences in language models, so the
<s>
and</s>
can't stand by itself and the vocabulary|V|
excludes them.Is it okay to add them in
|V|
?A: Actually if
|V|
is sufficiently large, having +2 for<s>
and</s>
would make little difference. As long as|V|
is consistent and fixed consistent in all the computation and it's sufficiently large, the language model probabilities of any sentence relative to another sentence with the same language model shouldn't be too different.