parquet4s not returning all records

234 Views Asked by At

I have a simple Scala application that uses parquet4s with fs2 to read a set of partitioned records (spread across directories, generated by a Spark job).

When I run the app, it only returns a fraction of the records from the partitioned directories. There are no errors ... it just stops after a specific number.

The console output has no useful information:

2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] RecordReader initialized will read a total of 83 records.
2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] at row 0. reading next block
2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] block read in memory in 0 ms. row count = 83

An equivalent app written using pyArrow in Python is able to retrieve all records.

Any help in debugging this issue is appreciated.

Thank you.

PS - For reference, this is the sample program I use: https://mjakubowski84.github.io/parquet4s/docs/partitioning/

1

There are 1 best solutions below

1
On

It looks like names of directories containing the partitioned data had characters that are not admissible, causing the reader to skip them, thereby returning fewer records.