parquet4s not returning all records

251 Views Asked by samirbajaj At 27 November 2025 at 14:41

I have a simple Scala application that uses parquet4s with fs2 to read a set of partitioned records (spread across directories, generated by a Spark job).

When I run the app, it only returns a fraction of the records from the partitioned directories. There are no errors ... it just stops after a specific number.

The console output has no useful information:

2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] RecordReader initialized will read a total of 83 records.
2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] at row 0. reading next block
2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] block read in memory in 0 ms. row count = 83

An equivalent app written using pyArrow in Python is able to retrieve all records.

Any help in debugging this issue is appreciated.

Thank you.

PS - For reference, this is the sample program I use: https://mjakubowski84.github.io/parquet4s/docs/partitioning/

Original Q&A

There are 1 best solutions below

samirbajaj On 08 April 2022 at 17:57

It looks like names of directories containing the partitioned data had characters that are not admissible, causing the reader to skip them, thereby returning fewer records.

parquet4s not returning all records

There are 1 best solutions below

Related Questions in SCALA

Related Questions in HADOOP

Related Questions in PARQUET

Related Questions in CATS-EFFECT

Related Questions in FS2

Trending Questions

Popular # Hahtags

Popular Questions