ParquetDataset not taking the partitions from the filters

282 Views Asked by At

I have a parquet dataset stored on s3, and I would like to query specific rows from the if. I am doing it using pyarrow.

My s3 dataset is partitioned using client year month day using hive partitioning (client=, year= ...). I am giving the parquetdataset the filters of client, year, month, day but it is taking a lot of time to get the result.

Here's some code snippet:

from pyarrow import fs
from pyarrow import parquet as pq
import pathlib
s3_file_system = fs.S3FileSystem()
filters = [
                    ("client_id", "=", 'client'),
                    ("year", "=", year),
                    ("month", "=", month),
                    ("day", "=", day)
                ]
dataset = pq.ParquetDataset(
                    str(pathlib.Path('s3_path')),
                    filesystem=s3_file_system,
                    filters=filters,
                )

I tried to give the partition with the s3_path (

dataset = pq.ParquetDataset(
                    str(pathlib.Path('s3_path/client=/year=/month=/day=')),
                    filesystem=s3_file_system,
                    filters=filters,
                )
)

and it worked perfectly. I don't know why the Parquetdataset is scanning all the files outside the partitions in the filters

0

There are 0 best solutions below