Read first 100 rows from Parquet file in C#

2.5k Views Asked by At

I have these huge parquet files, stored in a blob, with more than 600k rows and I'd like to retrieve the first 100 so I can send them to my client app. This is the code I use now for this functionality:

private async Task < Table > getParquetAsTable(BlobClient blob) {
  var table = new Table();
  using(var stream = await blob.OpenReadAsync()) {
    using(var memory = new MemoryStream()) {
      await stream.CopyToAsync(memory);
      var parquetReader = new ParquetReader(memory);

      table = parquetReader.ReadAsTable();
    }
  }
  var first100 = table.Take(100);
}

However, this process is kind of slow. await stream.CopyToAsync(memory); takes 20 seconds and table = parquetReader.ReadAsTable(); takes 15 more so totally I have to wait 35 seconds.

Is there a way to limit this stream and get the first 100 rows at once, without having to download all of the rows, format them with ReadAsTable and then take the first 100 only?

1

There are 1 best solutions below

0
On

With Cinchoo ETL - an open source library, you can stream Parquet file as below. (uses Parquet.net under the hood.)

Install Nuget package

install-package ChoETL.Parquet

Sample code

using ChoETL;

using (var r = new ChoParquetReader(@"*** Your Parquet file ***")
    .ParquetOptions(o => o.TreatByteArrayAsString = true)
    )
{
     var dt = r.Take(100).AsDataTable();
}

For more information, please visit codeproject article.