I need to store top k most frequent elements in a stream. To estimate the frequency i use count-min-sketch algorithms. My stream is consist of keys(as string). So basically every time i encounter a new key in my stream, i calculate the frequency of current key so far by looking into the count-min-sketch data structure. However i am unable to store the top k most frequent keys.
My first idea was store them in a min-heap with fixed size of k. And i store [frequency, key] inside this min-heap with comparator comnpare frequency. So every time i get a frequency of a key, i try to see if the heap size is over k, if it is, then i compare frequency of current key with the top(smallest) frequency in min-heap, if my current key's frequency is larger then i pop the top, and insert my key into the heap.
However i realize the min-heap is not a set, which means it allows duplication. Let's say i have a very hot key, and i keep counting it in the stream, so every time i insert this [frequency, key] into the heap, eventually my heap will be full of the same key, only different frequencies.
So i am wondering is there a good way to store the top k distinct more frequent element in count-min-sketch.
Makes sense to maintain a hashmap of
[key,<frequency,position>]pairs already in the heap.positionrefers to the index of the key within the heap (assuming array-based heap). When the key arrives, you check 2 conditions:-the key is in the hashmap
-its frequency has changed
If both are true, you find the key within the heap in
O(1)time since you already have its position stored in the hashmap, then modify the key's frequency and depending on whether the frequency increased or decreased, you do bubble-down or bubble-up from that position (O(logn)). After altering the position, you update the hashmap entry for that key with new frequency and position values.If 1st one is false, you follow a usual logic, i.e. compare the key with root, if heap is full and the key's frequency is larger than the root's frequency then you pop the root out of heap and remove it from hashmap, and meanwhile insert the key into the heap and hashmap.
If key is in the hashmap but its frequency has not been changed you do nothing.