How do I make a sorted, long compression data structure?

54 Views Asked by namdosan At 28 February 2024 at 03:28

How can I create a set data structure which compresses a sorted array of longs?

Goal:

Must support the following operations:

insertion of new numbers
deletion of a certain value
intersection with another set
union of another set
Sequence should consume as little memory as possible

Characteristics about the sequence:

Derived from the postings list of an inverted index
All the numbers are unique (no repeated)
64 bit longs
Original input length of sequence can range from very small (3 numbers) to extremely large (100k numbers)
The numbers are often close in proximity (almost consecutive), but then suddenly make a big jump e.g. 1, 2, 3, 1000, 1001, 1002, 15001, 15002

My thinking:

Create a red-black based set data structure which supports insertion/deletion/intersect/union
Numbers will be represented as nodes in the tree. But this would consume more space than the original array! How would I compress them and allow for insertion (compression) and deletion (removal from the compressed sequence) when needed

Original Q&A

There are 1 best solutions below

Mark Adler On 28 February 2024 at 16:54

The numbers are often close in proximity (almost consecutive), but then suddenly make a big jump

I would try a simple bit map (by position, a 1 means it's in the set, 0 it's not), with long sequences of 0 bits run-length compressed. That would enable very fast setting and unsetting of bits, as well as very fast intersections and unions using bit operations, & and | respectively.

A way to implement this could be a set of bit vectors, each providing the starting number and length of the bit vector. You would leave out bit vectors of all zeros that are over some length that you choose. You can experiment with that length to work best with your data. It will depend on how big your "big jumps" tend to be. The bit vectors that remain will always start with a 1 and end with a 1, so you could leave those out in the representation as well.

I would start by making a histogram of the lengths of runs of zero bits in a bit-set representation of your data. If what you're saying is correct, then the distribution should be strongly bimodal, and you can pick your cutoff in the middle of the dip.

How do I make a sorted, long compression data structure?

There are 1 best solutions below

Related Questions in DATA-STRUCTURES

Related Questions in INTEGER

Related Questions in SET

Related Questions in COMPRESSION

Related Questions in LONG-INTEGER

Trending Questions

Popular # Hahtags

Popular Questions