Python: write dataset as hive-partitioned and clustered parquet files (no JVM)

324 Views Asked by At

I would like write a table stored in a dataframe-like object (e.g. pandas dataframe, duckdb table, pyarrow table) in the parquet format that is both hive partitioned and clustered. Here's what I mean

  • Hive partitioning i.e., I can specify a set of partitioning columns e.g. (year, month, day, foo_col1)) which and this will result in data from each partition being written to a different path e.g. year=2024/month=01/day=01/foo_col1=bar_val/
  • Clustering (a.k.a. bucketing). I can also specify a set of clustering columns, and this will co-locate data with the same values to adjacent rows in the parquet files within each partition.

Note that to achieve clustering, it is also sufficient to be able to sort rows within each partition by a set of columns.

I can do this in spark (and pyspark) by sorting a dataframe and then writing the output with parquet and specifying the partitionBy columns. However, spark is a JVM-based framework that I am trying to avoid. I would love to achieve this using a package like pyarrow pandas or duckdb that does not require an external runtime like java (and which often has lower serialization/deserialization cost.

I have tried doing this in duckdb by first creating a sorted table and then using the COPY TO with the relevant hive paritioning options. This creates a hive-paritioned that's not quite right: within each individual file, the sorting seems to be respected, but the sorting is not respected across all files in the same hive partition. This prevents the optimization offered by clustering/bucketing, in which all identical values of the clustered column appear in a contiguous block of rows within the partition.

0

There are 0 best solutions below