Usually when it comes to parquet file operations,Parquet.Net package takes less/equal time compared to python. But my initial set of experiments doesn't align with that. To read 5 million data points in parquet python takes around 1 second while the .NET package takes around 20 seconds. The time taken to read the parquet files using .NET is far far higher than python. I am uploading the sample code here, can anybody point me out the reason for this behavior?
In C#:
{
List<string> metadata = new List<string>();
List<double[]> dataValues = new List<double[]>();
var watch = Stopwatch.StartNew();
using (Stream fileStream = File.OpenRead(path))
{
using (var parquetReader = new ParquetReader(fileStream))
{
DataField[] dataFields = parquetReader.Schema.GetDataFields();
for (int currentRowGroup = 0; currentRowGroup < parquetReader.RowGroupCount; currentRowGroup++)
{
using (ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(currentRowGroup))
{
for (int i = 0; i < yColIndex.Count(); i++)
{
var dataColumn = parquetReader.OpenRowGroupReader(currentRowGroup).ReadColumn(dataFields[yColIndex[i]]);
Array reData = dataColumn.Data;
dataValues.Add((double[])reData);
}
}
}
}
}
}
In Python:
def read_column_data_v1(file_path, file_name, columns):
file_path = f"{file_path}\\{file_name}.parquet"
file_data = pq.ParquetFile(file_path)
for i in range(file_data.metadata.num_row_groups):
data = file_data.read_row_group(i, columns)