Is it possible to convert run length encoded data directly into parquet format without expanding the data?

146 Views Asked by Steven Parrill At 08 May 2023 at 14:03

I have a set of JSON files containing run length encoded data in the following format:

{
    "Name": "Column A",
    "data": [
        {
            "Value": 15,
            "Count": 2
        },
        {
            "Value": 9,
            "Count": 6
        },
        {
            "Value": 3,
            "Count": 5
        },
}

Each JSON file stores the data for one column in the parquet file that I eventually want to create. The "Value" field corresponds to an enum and and the "count" files represents how many times to repeat the value to decode the data. The decoded data in the above example would be: 15, 15, 9, 9, 9, 9, 9, 9, 3, 3, 3, 3, 3.

Currently I am expanding all the data out and then writing out to a parquet file format using Pyarrow.

However, the data set I'm dealing with is so large that I'm running into performance and memory issues when I go to expand the data. I'm not able to store the table in memory without the program crashing. I have access to a computing cluster to run spark jobs although I'm not super familiar with that approach and not sure if it is best.

I want to find out if there is a way to convert the RLE JSON data directly into parquet format without having to expand the data out first and use up a ton of memory.

I have not been able to find any ways to do this and I'm not sure if its even possible given that parquet uses a combination of compression techniques (RLE and Bitpacking)

I gave up on trying to directly convert the data and focused on making the data expansion as memory efficient as possible. I used only pyarrow and numpy operations to achieve the conversion but that did not seem to work. I tried using numpy's memmap() to store my data out to disk but was still running into memory and performance issues when I went to write the table. I've also tried using standard python functions and a pandas dataframe but still no luck.

If there isn't a way to directly convert the data and expansion is my only option, are there any more efficient ways to do this? Would this all be easier to just do in Spark?

Original Q&A

Is it possible to convert run length encoded data directly into parquet format without expanding the data?

There are 0 best solutions below

Related Questions in NUMPY

Related Questions in PARQUET

Related Questions in PYARROW

Related Questions in APACHE-ARROW

Related Questions in RUN-LENGTH-ENCODING

Trending Questions

Popular # Hahtags

Popular Questions