Reading parquet files using Parquet.NET takes more time than pyarrow (python)

4.4k Views Asked by At

Usually when it comes to parquet file operations,Parquet.Net package takes less/equal time compared to python. But my initial set of experiments doesn't align with that. To read 5 million data points in parquet python takes around 1 second while the .NET package takes around 20 seconds. The time taken to read the parquet files using .NET is far far higher than python. I am uploading the sample code here, can anybody point me out the reason for this behavior?

In C#:

    {
        List<string> metadata = new List<string>();
        List<double[]> dataValues = new List<double[]>();
        var watch = Stopwatch.StartNew();

        using (Stream fileStream = File.OpenRead(path))
        {
            using (var parquetReader = new ParquetReader(fileStream))
            {
                DataField[] dataFields = parquetReader.Schema.GetDataFields();

                for (int currentRowGroup = 0; currentRowGroup < parquetReader.RowGroupCount; currentRowGroup++)
                {
                    using (ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(currentRowGroup))
                    {
                        for (int i = 0; i < yColIndex.Count(); i++)
                        {
                            var dataColumn = parquetReader.OpenRowGroupReader(currentRowGroup).ReadColumn(dataFields[yColIndex[i]]);
                            Array reData = dataColumn.Data;
                            dataValues.Add((double[])reData);
                        }
                    }
                }
            }
        }
    }

In Python:

    def read_column_data_v1(file_path, file_name, columns):
        file_path = f"{file_path}\\{file_name}.parquet"
        file_data = pq.ParquetFile(file_path)
        for i in range(file_data.metadata.num_row_groups):
             data = file_data.read_row_group(i, columns)
       
0

There are 0 best solutions below