ParquetDataset not taking the partitions from the filters

294 Views Asked by Mhmd Dar At 24 May 2025 at 11:50

I have a parquet dataset stored on s3, and I would like to query specific rows from the if. I am doing it using pyarrow.

My s3 dataset is partitioned using client year month day using hive partitioning (client=, year= ...). I am giving the parquetdataset the filters of client, year, month, day but it is taking a lot of time to get the result.

Here's some code snippet:

from pyarrow import fs
from pyarrow import parquet as pq
import pathlib
s3_file_system = fs.S3FileSystem()
filters = [
                    ("client_id", "=", 'client'),
                    ("year", "=", year),
                    ("month", "=", month),
                    ("day", "=", day)
                ]
dataset = pq.ParquetDataset(
                    str(pathlib.Path('s3_path')),
                    filesystem=s3_file_system,
                    filters=filters,
                )

I tried to give the partition with the s3_path (

dataset = pq.ParquetDataset(
                    str(pathlib.Path('s3_path/client=/year=/month=/day=')),
                    filesystem=s3_file_system,
                    filters=filters,
                )
)

and it worked perfectly. I don't know why the Parquetdataset is scanning all the files outside the partitions in the filters

Original Q&A

ParquetDataset not taking the partitions from the filters

There are 0 best solutions below

Related Questions in AMAZON-S3

Related Questions in PARQUET

Related Questions in PYARROW

Related Questions in PARQUET-DATASET

Trending Questions

Popular # Hahtags

Popular Questions