Reading parquet file error 'Destination is too short' with Parquet.Net

499 Views Asked by At

In this project, there is a C# API, where I need to build a simple program that reads a parquet file and returns it in json form. Normally I use python, reading a parquet file in python is as simple as 1 line -- but I'm stuck with C# (beginner). Below is a snippet from the overall program, which takes an S3 URL, downloads the parquet file into a temp file and from there on the below code follows.

The code is failing at this line - DataColumn column = await groupReader.ReadColumnAsync(dataFields[c]); ///ERROR

I am not entirely sure what the error message means -- is it the data being too big? Is it talking talking about a specific column, or data type not matching, or even column name being too long? I am trying to figure out what the error is, why it is, and also how to deal with it? Reading the same parquet file in Python (pd.read_parquet(filename)) reveals all columns are float64 type, there are 90k rows and 30 columns.

System.ArgumentException
  HResult=0x80070057
  Message=Destination is too short. (Parameter 'destination')
  Source=System.Private.CoreLib
  StackTrace:
   at System.ThrowHelper.ThrowArgumentException_DestinationTooShort()
   at Parquet.Encodings.ParquetPlainEncoder.Decode(Span`1 source, Span`1 data)
   at Parquet.Encodings.ParquetPlainEncoder.Decode(Array data, Int32 offset, Int32 count, SchemaElement tse, Span`1 source, Int32& elementsRead)
   at Parquet.File.DataColumnReader.ReadColumn(Span`1 src, Encoding encoding, Int64 totalValuesInChunk, Int32 totalValuesInPage, PackedColumn pc)
   at Parquet.File.DataColumnReader.<ReadDataPageV1Async>d__13.MoveNext()
   at Parquet.File.DataColumnReader.<ReadAsync>d__8.MoveNext()
   at ConvertController.<ConvertToJSON>d__2.MoveNext() in C:\Users\myuser\Desktop\repos\frontend\project\Controllers\WebAPI_ParquetController.cs:line 78

  This exception was originally thrown at this call stack:
    [External Code]
    ConvertController.ConvertToJSON(string) in WebAPI_ParquetController.cs

Code from the point the file is downloaded to a temporary file -

        // Open the parquet file stream
        using (Stream fileStream = System.IO.File.OpenRead(tempFilePath))
        {
            // Open parquet file reader
            using (ParquetReader parquetReader = await ParquetReader.CreateAsync(fileStream))
            {
                // Get file schema
                DataField[] dataFields = parquetReader.Schema.GetDataFields();

                var result = new List<Dictionary<string, object>>();

                // Enumerate through row groups in this file
                for (int i = 0; i < parquetReader.RowGroupCount; i++)
                {
                    // Create row group reader
                    using (ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(i))
                    {
                        var rowGroupResult = new Dictionary<string, object>();

                        // Read all columns inside each row group
                        for (int c = 0; c < dataFields.Length; c++)
                        {


                            DataColumn column = await groupReader.ReadColumnAsync(dataFields[c]); ///ERROR

                            // Cast column data to the appropriate type
                            var columnData = column.Data;
                            var decodedData = new object[columnData.Length];
                            // Decode the column data
                            for (int idx = 0; idx < columnData.Length; idx++)
                            {
                                decodedData[idx] = column.Data.GetValue(idx);
                            }
                            string columnName = dataFields[c].Name;

                            rowGroupResult[columnName] = decodedData;
                        }

                        result.Add(rowGroupResult);
                    }
                }

                // Convert the result to JSON
                var jsonResult = JsonConvert.SerializeObject(result);

                return Ok(jsonResult);
            }
        }
    } 
1

There are 1 best solutions below

0
On

I also bumped into errors trying to read Parquet files with Parquet.NET using methods such as ReadColumn and ReadRow so I switched to ReadAsTableAsync() method which works for me. The following method is an example for reading a Parquet file content as a Parquet table:

 public List<IClass> ReadStuffFromParquetFile(string dirpath)
    {
        List<IClass> results = new List<IClass>();
        try
        {
            string[] parquetFiles = System.IO.Directory.GetFiles(dirpath, "*.parquet");
            ParquetReader reader = ParquetReader.CreateAsync(parquetFiles[0]).Result;
            Table parquetTable = reader.ReadAsTableAsync().Result;
            for (int i = 0; i < parquetTable.Count; i++)
            {
                dynamic json = JsonConvert.DeserializeObject(parquetTable[i].ToString());

                    IClass obj = new IClass
                    {
                        Name = json["Name"]
                    };
                    results.Add(obj);
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex);
        }
        return results;
    }

Hope it helps