Zipf Distribution based number generation

759 Views Asked by At

I want to generate a popularity distribution for a small data set, which should follow Zipf law.

The available parameters are:
Total number of viewers : 1 Million
Total number of videos : 36

I want to associate total number of viewers to each video according to Zipf law. For instance, how many of viewers will watch video1, video2 .. and so on.
Can anyone please tell me the formula or methodology?

1

There are 1 best solutions below

3
On

The Wikipedia article on Zipf's law includes a few descriptions of the distribution, including some ways to calculate it:

https://en.wikipedia.org/wiki/Zipf%27s_law

The first equation under the Theoretical review section might help. Using this, we can write a short Python script to associate the total number of viewers for each of 36 videos:

N_elements = 1000000
video_exponent = 1

distribution_sum = 0
total_viewers = 0


# First, add up the relative number of viewers across all 36 movie ranks
for k_rank in range(1,36):

    sum = 0
    for n in range(1, N_elements):
        sum = sum + 1/(n**video_exponent)

    distribution_sum = distribution_sum + (1/(k_rank**video_exponent))/sum


# Next, distribute the number of viewers so that the total comes to 1,000,000
print("Movie Rank | # of Viewers")
for k_rank in range(1,36):

    sum = 0
    for n in range(1, N_elements):
        sum = sum + 1/(n**video_exponent)

    viewers_at_k_rank = round((N_elements/(k_rank**video_exponent))/(sum * distribution_sum))

    print(k_rank, end="|")
    print(viewers_at_k_rank)

    total_viewers = total_viewers + viewers_at_k_rank


print("\nSum of all viewers accounted for so far, to make sure we're at 1,000,000")
print(total_viewers)

The result adds up to 1,000,002 viewers, but that's not a big deal. Why isn't it a big deal, you ask? It seems that while many different things follow a general Zipfian distribution, they tend to be slightly different based on what type of thing it is. The video_exponent variable can be adjusted so that the Zipfian distribution simulated above can more closely match actual video statistics. The difference is usually much greater than 2 in 1,000,000.

You can get an idea of what the real-world video_exponent is by finding some real ranked videos, and adjusting video_exponent and N_elements until the code matches the real numbers. Then, reset N_elements back to 1,000,000, and you'll have a realistic video-watching data set.