Process parquet file row-wise

643 Views Asked by At

I have a high scale distributed system which downloads a lot of large .csv files and indexes the data everyday. Lets say, our file(file.csv) is: col1 col2 col3 user11 val12 val13 user21 val22 val23

Then we read this file row wise and store the byte offset of where the row of user11 or user12 is located in this file. eg: Index table - user11 -> 1120-2130 (bytes offset) user12 -> 2130-3545 (bytes offset)

When someone says, delete the data for user11, we refer this table, download and open the file, delete the byte offset in the file. Please note, this byte offset is of the entire row.

How can I design the system to process parquet files? Parquet files operate column wise. To get an entire row of say 10 columns, will i have to make 10 calls? Then, form an entire row, calculate the bytes and then store them in the table? Then, while deleting, I will have to again form the row and then delete the bytes?

Other option is store the byte offset of each column instead and process column wise but that will blow up the index table.

How can parquet files be efficiently processed in row-wise manner? Current system is a background job in C#.

1

There are 1 best solutions below

0
On

Using Cinchoo ETL, an open source library to convert CSV to parquet file easily.

string csv = @"Id,Name
1,Tom
2,Carl
3,Mark";

using (var r = ChoCSVReader.LoadText(csv)
   .WithFirstLineHeader()
   )
{
    using (var w = new ChoParquetWriter("*** PARQUET FILE PATH ***"))
        w.Write(r);
}

For more information, pls check https://www.codeproject.com/Articles/5270332/Cinchoo-ETL-Parquet-Reader article.

Sample fiddle: https://dotnetfiddle.net/Ra8yf4

Disclaimer: I'm author of this library.