If I want to get the unique count in a list of element that can be added and deleted, is there a way to do that?
For example
add key1
delete key1
add key1
should give a unique count of 1
but if I have a naive method of 2 hll one for delete and one for add, it return 0?
Is there a way I can dedup key in hll?
I do not see how to do it with hyper log log, but I do see how to do it with less efficient cardinality estimators.
Here is a simple cardinality estimator, which you can find in http://www.cse.unsw.edu.au/~cs9314/07s1/lectures/Lin_CS9314_References/fm85.pdf. Calculate the hash value of each element. Keep the smallest
m
hash values. Use the size of them
th hash value to estimate the cardinality of the whole set. (Let's ignore hash collisions.)Now here is an adaptation to handle some deletions. Keep the smallest
2m
hash values. Use the size of them
'th smallest to estimate the cardinality of the whole set. If a hash element should be deleted, just remove it from the set. As long as your set doesn't fall in size by around a factor of 2, this should work pretty well.And if you need to handle more? Add the idea of "ghost" elements. When you delete a hash value, add a "ghost" hash value at where the
2m+1
'th hash value would be expected to be. When you delete a real hash value, each "ghost" element has a random chance of being deleted with it matching the fraction of real elements that got deleted. If a ghost is deleted, you insert more. If you insert enough that a ghost gets too big to be in the smallest2m
, you let it fall off like any other value.The resulting algorithm will need more memory, but handles adds and deletes. It should be reasonably accurate even if a large fraction of your data gets deleted.