"hoodie.parquet.max.file.size" and "hoodie.parquet.small.file.limit" Property is Being Ignored

26 Views Asked by At

I want my hoodie file size to be between small=50MB and max=100MB.

The following configs are being used as map options for upsert:

val hudiOptions = Map[String, String](
      HoodieWriteConfig.TBL_NAME.key -> hudiTableConfig.tableName,
      DataSourceWriteOptions.TABLE_TYPE.key -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
      DataSourceWriteOptions.RECORDKEY_FIELD.key() -> hudiTableConfig.recordKey,
      DataSourceWriteOptions.PRECOMBINE_FIELD.key() -> hudiTableConfig.combineKey,
      "hoodie.parquet.max.file.size" -> "125829120",
      "hoodie.parquet.small.file.limit" -> "52428800")

updatedDataFrame.write
      .format(HudiConstants.HudiFormat)
      .options(hudiOptions)
      .option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
      // .option("hoodie.upsert.shuffle.parallelism", "200") // Default shuffle parallelism is 200
      .mode(saveMode.get)
      .save(s"$storageSystemPath/${hudiTableConfig.tableName}/")

My input df is date partioned and sizes are like these:

date=2024-03-07 -> 1.00 GB
date=2024-03-06 -> 52.2 MB
date=2024-03-06 -> 54.4 MB
date=2024-03-06 -> 60 MB

After reading and upserting I am continuously getting files as 11.7MB:

enter image description here

Any suggesstion where things are going wrong?

1

There are 1 best solutions below

0
parisni On

From your details I assume you only upserted one time in the table, aka one commit.

For sizing files hudi uses the previous commit stats to understand the workload (how much a row usually size for the current table).

So you would expect hudi to converge to the specified size after few commits.