I have a parquet dataset stored on s3, and I would like to query specific rows from the if. I am doing it using pyarrow.
My s3 dataset is partitioned using client year month day using hive partitioning (client=, year= ...). I am giving the parquetdataset the filters of client, year, month, day but it is taking a lot of time to get the result.
Here's some code snippet:
from pyarrow import fs
from pyarrow import parquet as pq
import pathlib
s3_file_system = fs.S3FileSystem()
filters = [
("client_id", "=", 'client'),
("year", "=", year),
("month", "=", month),
("day", "=", day)
]
dataset = pq.ParquetDataset(
str(pathlib.Path('s3_path')),
filesystem=s3_file_system,
filters=filters,
)
I tried to give the partition with the s3_path (
dataset = pq.ParquetDataset(
str(pathlib.Path('s3_path/client=/year=/month=/day=')),
filesystem=s3_file_system,
filters=filters,
)
)
and it worked perfectly. I don't know why the Parquetdataset is scanning all the files outside the partitions in the filters