How can I create a set data structure which compresses a sorted array of longs?
Goal:
Must support the following operations:
- insertion of new numbers
- deletion of a certain value
- intersection with another set
- union of another set
- Sequence should consume as little memory as possible
Characteristics about the sequence:
- Derived from the postings list of an inverted index
- All the numbers are unique (no repeated)
- 64 bit longs
- Original input length of sequence can range from very small (3 numbers) to extremely large (100k numbers)
- The numbers are often close in proximity (almost consecutive), but then suddenly make a big jump e.g. 1, 2, 3, 1000, 1001, 1002, 15001, 15002
My thinking:
- Create a red-black based set data structure which supports insertion/deletion/intersect/union
- Numbers will be represented as nodes in the tree. But this would consume more space than the original array! How would I compress them and allow for insertion (compression) and deletion (removal from the compressed sequence) when needed
I would try a simple bit map (by position, a 1 means it's in the set, 0 it's not), with long sequences of 0 bits run-length compressed. That would enable very fast setting and unsetting of bits, as well as very fast intersections and unions using bit operations,
&and|respectively.A way to implement this could be a set of bit vectors, each providing the starting number and length of the bit vector. You would leave out bit vectors of all zeros that are over some length that you choose. You can experiment with that length to work best with your data. It will depend on how big your "big jumps" tend to be. The bit vectors that remain will always start with a 1 and end with a 1, so you could leave those out in the representation as well.
I would start by making a histogram of the lengths of runs of zero bits in a bit-set representation of your data. If what you're saying is correct, then the distribution should be strongly bimodal, and you can pick your cutoff in the middle of the dip.