Reading the petastorm/etl/dataset_metadata.py script I found this code
if row_groups_key != ".":
for row_group in range(row_groups_per_file[row_groups_key]):
rowgroups.append(pq.ParquetDatasetPiece(
piece.path,
open_file_func=dataset.fs.open,
row_group=row_group,
partition_keys=piece.partition_keys
))
where pq is defined like:
from pyarrow import parquet as pq
I've searched everywhere for the ParquetDatasetPiece class and can't find it. Somebody can tell me where is the ParquetDatasetPiece class?
You can find it in the parquet part of the
pyarrow
codebase: https://github.com/apache/arrow/blob/951663a41c183c8fec5a4da9a8f9daf45ed85451/python/pyarrow/parquet/core.py#L1059-L1084Note: it is being deprecated from pyarrow version 5.0.