I have a set of JSON files containing run length encoded data in the following format:
{
"Name": "Column A",
"data": [
{
"Value": 15,
"Count": 2
},
{
"Value": 9,
"Count": 6
},
{
"Value": 3,
"Count": 5
},
}
Each JSON file stores the data for one column in the parquet file that I eventually want to create. The "Value" field corresponds to an enum and and the "count" files represents how many times to repeat the value to decode the data. The decoded data in the above example would be: 15, 15, 9, 9, 9, 9, 9, 9, 3, 3, 3, 3, 3.
Currently I am expanding all the data out and then writing out to a parquet file format using Pyarrow.
However, the data set I'm dealing with is so large that I'm running into performance and memory issues when I go to expand the data. I'm not able to store the table in memory without the program crashing. I have access to a computing cluster to run spark jobs although I'm not super familiar with that approach and not sure if it is best.
I want to find out if there is a way to convert the RLE JSON data directly into parquet format without having to expand the data out first and use up a ton of memory.
I have not been able to find any ways to do this and I'm not sure if its even possible given that parquet uses a combination of compression techniques (RLE and Bitpacking)
I gave up on trying to directly convert the data and focused on making the data expansion as memory efficient as possible. I used only pyarrow and numpy operations to achieve the conversion but that did not seem to work. I tried using numpy's memmap() to store my data out to disk but was still running into memory and performance issues when I went to write the table. I've also tried using standard python functions and a pandas dataframe but still no luck.
If there isn't a way to directly convert the data and expansion is my only option, are there any more efficient ways to do this? Would this all be easier to just do in Spark?