Given 100 million documents containing 9 million terms, how do you calculate the total posting entries for this using a simple zipf approximation?
My approach:
Zipf approximation:
substituting in the formula we get probabilities as (N = 9,000,000):
word1 -> 1/16
word2 -> 1/(2*16)
...
wordN -> 1/(N*16), where N = 9,000,000
But now how do I proceed? Do I assume all words will be distributed equally throughout the documents? Total word count is also unknown, so the probabilities also do not help in anyway.
Any help would be appreciated.