I have a partitioned dataset stored on internal S3 cloud. I am reading the dataset with pyarrow table
import pyarrow.dataset as ds
my_dataset = ds.dataset( ds_name, format="parquet", filesystem=s3file, partitioning="hive")
fragments = list(my_dataset.get_fragments())
required_fragment = fragements.pop()
The metadata from the required fragment shows the following:
required_fragment.metadata
<pyarrow._parquet.FileMetaData object at 0x00000291798EDF48>
created_by: parquet-cpp-arrow version 9.0.0
num_columns: 22
num_rows: 949650
num_row_groups: 29
format_version: 1.0
serialized_size: 68750
converting this to table however takes a long time
%timeit required_fragment.to_table()
6min 29s ± 1min 15s per loop (mean ± std. dev. of 7 runs, 1 loop each)
The size of the table itself is about 272mb
required_fragment.to_table().nbytes
272850898
Any ideas how i can speed up converting the ds.fragment to table?
Updates
So I instead of pyarrow.dataset, i tried using pyarrow.parquet Only part of my code that changed is
import pyarrow.parquet as pq
my_dataset = pq.ParquetDataset(ds_name,filesystem=s3file, partitioning="hive", use_legacy_dataset=False )
fragments = my_dataset.fragments
required_fragment = fragements.pop()
and when i tried the code again, the performance was much better
%timeit required_fragment.to_table()
12.4 s ± 1.56 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
While i am happy with the better performance, it still feels confusing as under the hood, by setting use_legacy_dataset = False, the program should have similar outcomes
PC Information Installed RAM: 21.0GB Software: Windows 10 Enterprise Internet speed: 10Mbps / 156 Mpbs (download / upload) s3 location: Asia