I used Petastorm row_group_indexer to build index for a column in a petastorm dataset. After that, the size of the metadata file increased significantly and Pyarrow can't load the dataset anymore due to this error:
OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit
Here is the code I'm using to load the dataset:
from pyarrow import parquet as pq
dataset_path = "path/to/dataset/"
dataset = pq.ParquetDataset(path_or_paths=dataset_path)
Code used for indexing the materialized petastorm dataset:
from pyspark.sql import SparkSession
from petastorm.etl.rowgroup_indexers import SingleFieldIndexer
from petastorm.etl.rowgroup_indexing import build_rowgroup_index
dataset_url = "file:///path/to/dataset"
spark = SparkSession.builder.appName("demo").config("spark.jars").getOrCreate()
indexer = [SingleFieldIndexer(index_name="my_index",index_field="COLUMN1")]
build_rowgroup_index(dataset_url, spark.sparkContext, indexer)