I have a simple Scala application that uses parquet4s with fs2 to read a set of partitioned records (spread across directories, generated by a Spark job).
When I run the app, it only returns a fraction of the records from the partitioned directories. There are no errors ... it just stops after a specific number.
The console output has no useful information:
2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] RecordReader initialized will read a total of 83 records.
2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] at row 0. reading next block
2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] block read in memory in 0 ms. row count = 83
An equivalent app written using pyArrow in Python is able to retrieve all records.
Any help in debugging this issue is appreciated.
Thank you.
PS - For reference, this is the sample program I use: https://mjakubowski84.github.io/parquet4s/docs/partitioning/
 
                        
It looks like names of directories containing the partitioned data had characters that are not admissible, causing the reader to skip them, thereby returning fewer records.