I have a dataset of around 200 MB, containing only integers. My goal is to use FP-trees to compress this dataset and reduce the final output file size. This is for academic purposes so I can't use zipping software and restricted to FP-trees. The size of the output file is determined by the count of integers in the output and the size of the associated mapping table. To illustrate, consider this example:
T1 - 1, 2, 3, 4, 5
T2 - 1, 2, 3, 4, 6
T3 - 1, 2, 3, 4, 5, 7
T4 - 1, 2, 3, 4, 5, 6, 7
With a mapping like:
{
10 : {1, 2, 3, 4},
11 : {5, 7},
}
The transformed dataset becomes:
T1 - 10, 5
T2 - 10, 6
T3 - 10, 11
T4 - 10, 11, 6
However, alternative mappings are possible, like:
{10: (1, 2), 11: (3, 4, 5), 12: (1, 2, 3), 13: (4, 5)}
I've successfully implemented the FP-tree algorithm to identify frequent itemsets along with their counts. However, due to the significant time required for constructing an FP-tree on large datasets, I've divided the dataset into smaller sections for compression.
My main objective is to convert each transaction into a set of keys from the mapping, with the aim of reducing the overall size. However, the difficulty lies in identifying segments within each transaction that can be replaced with a key to achieve effective size reduction. Therefore, I'm looking for effective strategies or heuristics that can assist in making these selections and optimizing the compression process. I welcome any insights or recommendations you may have to help tackle this challenge.