Improving read performance of pyarrow

977 Views Asked by At

I have a partitioned dataset stored on internal S3 cloud. I am reading the dataset with pyarrow table

import pyarrow.dataset as ds
my_dataset = ds.dataset( ds_name, format="parquet", filesystem=s3file, partitioning="hive")
fragments = list(my_dataset.get_fragments())
required_fragment = fragements.pop()

The metadata from the required fragment shows the following:

required_fragment.metadata
<pyarrow._parquet.FileMetaData object at 0x00000291798EDF48>
  created_by: parquet-cpp-arrow version 9.0.0
  num_columns: 22
  num_rows: 949650
  num_row_groups: 29
  format_version: 1.0
  serialized_size: 68750

converting this to table however takes a long time

%timeit required_fragment.to_table()
6min 29s ± 1min 15s per loop (mean ± std. dev. of 7 runs, 1 loop each)

The size of the table itself is about 272mb

required_fragment.to_table().nbytes
272850898

Any ideas how i can speed up converting the ds.fragment to table?

Updates

So I instead of pyarrow.dataset, i tried using pyarrow.parquet Only part of my code that changed is

import pyarrow.parquet as pq
my_dataset = pq.ParquetDataset(ds_name,filesystem=s3file, partitioning="hive", use_legacy_dataset=False )
fragments = my_dataset.fragments
required_fragment = fragements.pop()

and when i tried the code again, the performance was much better

%timeit required_fragment.to_table()
12.4 s ± 1.56 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

While i am happy with the better performance, it still feels confusing as under the hood, by setting use_legacy_dataset = False, the program should have similar outcomes

PC Information Installed RAM: 21.0GB Software: Windows 10 Enterprise Internet speed: 10Mbps / 156 Mpbs (download / upload) s3 location: Asia

0

There are 0 best solutions below