How do I use LINQ on array produced by reading with Parquet.net?

1.1k Views Asked by At

I am not experienced with C#. I need to read a parquet file and then use LINQ to query the data read from the file. I don't know if I need to deserialise.

The following is the data in the parquet file

enter image description here

The data is being read into the 'records' variable. But when I use LINQ on it, I get the error, "Unable to cast object of type 'Parquet.Data.DataColumn' to type 'LinqAndParquet.DataFrame'." at the LINQ query.

public class Program
{
    public static DataColumn[] allData;
    public static DataColumn[] ReadParquetFile()
    {
        using (Stream fileStream = File.OpenRead(@"F:\AutomationRunStation\11_12.parquet"))
        {
            // open parquet file reader
            using (var parquetReader = new Parquet.ParquetReader(fileStream))
            {
                // get file schema (available straight after opening parquet reader)
                // however, get only data fields as only they contain data values
                DataField[] dataFields = parquetReader.Schema.GetDataFields();

                // enumerate through row groups in this file
                for (int i = 0; i < parquetReader.RowGroupCount; i++)
                {
                    // create row group reader
                    using (ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(i))
                    {
                        // read all columns inside each row group (you have an option to read only
                        // required columns if you need to.
                        allData = dataFields.Select(groupReader.ReadColumn).ToArray();
                    }
                }

                return allData;
            }
        }
    }

    static void Main(string[] args)
    {
        var records = ReadParquetFile();
        
        var queryResult = from DataFrame data in records
                          where data.EventId == 280000001
                          select data.Loss;

        Console.WriteLine(queryResult);
        Console.ReadKey();
    }
}
2

There are 2 best solutions below

0
On

With Cinchoo ETL - an open source library, you can parse parquet file and use linq to query on them.

using (var r = new ChoParquetReader("*** YOUR PARQUET FILE PATH ***"))
{
    foreach (var rec in r)
        Console.WriteLine(rec.Dump());
}

Disclaimer: I'm author of this library.

3
On
public static DataColumn[] ReadParquetFile()

this returns DataColumn. So

    var records = ReadParquetFile();
    
    var queryResult = from DataFrame data in records
                      where data.EventId == 280000001
                      select data.Loss;

records in this scope is array of DataColumn. But in linq you are specifying data as DataFrame. Cast is not valid and you get exception.