Zipf Distribution based number generation

Question

Zipf Distribution based number generation

752 Views Asked by yours At 16 August 2025 at 22:37

I want to generate a popularity distribution for a small data set, which should follow Zipf law.

The available parameters are:
Total number of viewers : 1 Million
Total number of videos : 36

I want to associate total number of viewers to each video according to Zipf law. For instance, how many of viewers will watch video1, video2 .. and so on.
Can anyone please tell me the formula or methodology?

Original Q&A

There are 1 best solutions below

**bey** · Answer 1

The Wikipedia article on Zipf's law includes a few descriptions of the distribution, including some ways to calculate it:

https://en.wikipedia.org/wiki/Zipf%27s_law

The first equation under the Theoretical review section might help. Using this, we can write a short Python script to associate the total number of viewers for each of 36 videos:

N_elements = 1000000
video_exponent = 1

distribution_sum = 0
total_viewers = 0


# First, add up the relative number of viewers across all 36 movie ranks
for k_rank in range(1,36):

    sum = 0
    for n in range(1, N_elements):
        sum = sum + 1/(n**video_exponent)

    distribution_sum = distribution_sum + (1/(k_rank**video_exponent))/sum


# Next, distribute the number of viewers so that the total comes to 1,000,000
print("Movie Rank | # of Viewers")
for k_rank in range(1,36):

    sum = 0
    for n in range(1, N_elements):
        sum = sum + 1/(n**video_exponent)

    viewers_at_k_rank = round((N_elements/(k_rank**video_exponent))/(sum * distribution_sum))

    print(k_rank, end="|")
    print(viewers_at_k_rank)

    total_viewers = total_viewers + viewers_at_k_rank


print("\nSum of all viewers accounted for so far, to make sure we're at 1,000,000")
print(total_viewers)

The result adds up to 1,000,002 viewers, but that's not a big deal. Why isn't it a big deal, you ask? It seems that while many different things follow a general Zipfian distribution, they tend to be slightly different based on what type of thing it is. The video_exponent variable can be adjusted so that the Zipfian distribution simulated above can more closely match actual video statistics. The difference is usually much greater than 2 in 1,000,000.

You can get an idea of what the real-world video_exponent is by finding some real ranked videos, and adjusting video_exponent and N_elements until the code matches the real numbers. Then, reset N_elements back to 1,000,000, and you'll have a realistic video-watching data set.

Zipf Distribution based number generation

There are 1 best solutions below

Related Questions in RANDOM

Related Questions in STATISTICS

Related Questions in DISTRIBUTION

Related Questions in ZIPF

Trending Questions

Popular # Hahtags

Popular Questions