I want to generate a popularity distribution for a small data set, which should follow Zipf
law.
The available parameters are:
Total number of viewers : 1 Million
Total number of videos : 36
I want to associate total number of viewers to each video according to Zipf
law. For instance, how many of viewers will watch video1, video2 .. and so on.
Can anyone please tell me the formula or methodology?
The Wikipedia article on Zipf's law includes a few descriptions of the distribution, including some ways to calculate it:
https://en.wikipedia.org/wiki/Zipf%27s_law
The first equation under the Theoretical review section might help. Using this, we can write a short Python script to associate the total number of viewers for each of 36 videos:
The result adds up to 1,000,002 viewers, but that's not a big deal. Why isn't it a big deal, you ask? It seems that while many different things follow a general Zipfian distribution, they tend to be slightly different based on what type of thing it is. The video_exponent variable can be adjusted so that the Zipfian distribution simulated above can more closely match actual video statistics. The difference is usually much greater than 2 in 1,000,000.
You can get an idea of what the real-world video_exponent is by finding some real ranked videos, and adjusting video_exponent and N_elements until the code matches the real numbers. Then, reset N_elements back to 1,000,000, and you'll have a realistic video-watching data set.