Pyarrow parquet can't read dataset with large metadata

311 Views Asked by At

I used Petastorm row_group_indexer to build index for a column in a petastorm dataset. After that, the size of the metadata file increased significantly and Pyarrow can't load the dataset anymore due to this error:

OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit

Here is the code I'm using to load the dataset:

from pyarrow import parquet as pq

dataset_path = "path/to/dataset/"

dataset = pq.ParquetDataset(path_or_paths=dataset_path)

Code used for indexing the materialized petastorm dataset:

from pyspark.sql import SparkSession
from petastorm.etl.rowgroup_indexers import SingleFieldIndexer
from petastorm.etl.rowgroup_indexing import build_rowgroup_index

dataset_url = "file:///path/to/dataset"

spark = SparkSession.builder.appName("demo").config("spark.jars").getOrCreate()

indexer = [SingleFieldIndexer(index_name="my_index",index_field="COLUMN1")]

build_rowgroup_index(dataset_url, spark.sparkContext, indexer)
0

There are 0 best solutions below