Calculate the number of posting entries using zipf approximation

31 Views Asked by At

Given 100 million documents containing 9 million terms, how do you calculate the total posting entries for this using a simple zipf approximation?

My approach:

Zipf approximation:

enter image description here

substituting in the formula we get probabilities as (N = 9,000,000):

word1 -> 1/16
word2 -> 1/(2*16)
...
wordN -> 1/(N*16), where N = 9,000,000

But now how do I proceed? Do I assume all words will be distributed equally throughout the documents? Total word count is also unknown, so the probabilities also do not help in anyway.

Any help would be appreciated.

0

There are 0 best solutions below