Explanation on HyperLogLog algorithm

533 Views Asked by At

First of all let me start off by saying that I read this question.

So as I was strolling through the internet and I came across that algorithm and I was wondering how it worked. After reading about it I did understand how it counts the views by hashing and using bits.

What I haven't quite understand yet, is how can be sure to avoid counting the same view again. Do we store each hashed value we come across and before incrementing the count check if it already exists in our array or whatever?

Doesn't that make it a lot less efficient if we have 1000k+ items?

1

There are 1 best solutions below

0
On

The cool thing about HyperLogLog is that you do not have to store the entire array you have seen which would be O(n), and not even the unique values. What you do need to store is of oreder O(log(log(n)) which is much lower.

Basically, if two objects have the same value, then their hash will be the same. This means that the leading bits will also be the same. So having multiple objects with the same value won't affect the computation at all.

This fact also allows easy parallelism - you can divide your population, and calculate the max separately, combining them later by calculating the maximum of your separate maxes.