Is there any way to stream to a parquet file in Ruby?

46 Views Asked by At

I am trying to create an archival tool for a Ruby On Rails app.

To this end, I wish to store the data in parquet files, ideally with one parquet file per table per time interval.

However, I do not have the resources for all of my tables to have the entire time interval's data in memory at once. I was hoping there would be some way to stream the data in batches to a parquet file by maintaining a writer to the parquet file, only closing it once all the data for the time interval had been written.

I am using red-parquet and red-arrow gems currently, and unfortunately have been unable to figure out how to do so. If anyone has any ideas or solutions it'd be appreciated.

I have tried to look at the documentation, as well as the code provided in the Apache Arrow Github. In multiple tests a 'Parquet::ArrowFileWriter' is opened and used, but I cannot find any documentation indicating how to use it. And when I try to use it in my own project it appears Parquet::ArrowFileWriter doesnt exist.

I am using the latest versions of the red-arrow and red-parquet gems.

0

There are 0 best solutions below