I have these huge parquet files, stored in a blob, with more than 600k rows and I'd like to retrieve the first 100 so I can send them to my client app. This is the code I use now for this functionality:
private async Task < Table > getParquetAsTable(BlobClient blob) {
var table = new Table();
using(var stream = await blob.OpenReadAsync()) {
using(var memory = new MemoryStream()) {
await stream.CopyToAsync(memory);
var parquetReader = new ParquetReader(memory);
table = parquetReader.ReadAsTable();
}
}
var first100 = table.Take(100);
}
However, this process is kind of slow. await stream.CopyToAsync(memory);
takes 20 seconds and table = parquetReader.ReadAsTable();
takes 15 more so totally I have to wait 35 seconds.
Is there a way to limit this stream and get the first 100 rows at once, without having to download all of the rows, format them with ReadAsTable
and then take the first 100 only?
With Cinchoo ETL - an open source library, you can stream Parquet file as below. (uses Parquet.net under the hood.)
Install Nuget package
Sample code
For more information, please visit codeproject article.