Parquet ReadAsTable() method takes too long for big files

1.5k Views Asked by At

I have this code snippet:

private Table getParquetAsTable(BlobClient blob)
{
     var stream = blob.OpenRead();
     var parquetReader = new ParquetReader(stream);

     return parquetReader.ReadAsTable();
}

whit this code does is it reads a parquet file from Azure blob storage. If my file has <= 10 columns, it gets returned fast however for bigger files I have to wait more than 40 seconds for it to get returned. While debugging, I noticed that the slow "thing" happens in my return parquetReader.ReadAsTable(). I use the ParquetDotNet library for reading a parquet file. Is there a way to speed this up? Can I limit the stream, for the first 100 bytes for example, and have it returned faster? If so, how can I do this?

1

There are 1 best solutions below

4
On

I would suggest reading the "Reading Files" section of the official web site, that shows how to read a row at a time. Obviously, overall this will take the same amount of time (or even longer), but it means you can process rows individually, rather than loading everything at once.

using (Stream fileStream = System.IO.File.OpenRead("c:\\test.parquet"))
{
   // open parquet file reader
   using (var parquetReader = new ParquetReader(fileStream))
   {
      // get file schema (available straight after opening parquet reader)
      // however, get only data fields as only they contain data values
      DataField[] dataFields = parquetReader.Schema.GetDataFields();

      // enumerate through row groups in this file
      for(int i = 0; i < parquetReader.RowGroupCount; i++)
      {
         // create row group reader
         using (ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(i))
         {
            // read all columns inside each row group (you have an option to read only
            // required columns if you need to.
            DataColumn[] columns = dataFields.Select(groupReader.ReadColumn).ToArray();

            // get first column, for instance
            DataColumn firstColumn = columns[0];

            // .Data member contains a typed array of column data you can cast to the type of the column
            Array data = firstColumn.Data;
            int[] ids = (int[])data;
         }
      }
   }
}