Spark set minimum output file size from Dataset write

81 Views Asked by At

I want to control size of files written to HDFS from Spark Dataframe from Java application, file format is ORC.

My Datasets vary greatly in size:

  • 1st: 200 partitions, each 30MB (ORC)
  • 2nd: 200 partitions, each 0.6MB (ORC)
  • 3rd: 200 partitions, each 0.2MB (ORC)

I need to assure that minimum file size written is 120MB, if whole dataset is smaller than this size, there should be 1 partition for it.

I tried following approach:

dataset.repartition(calcNumPartitions).write().mode("overwrite").orc(path);

where:

calcNumPartitions(Dataset<Row> dataset) {
  BigInt datasetSizeBytes = dataset.queryExecution().optimizedPlan().stats().sizeInBytes();
  return (int) Math.ceil(datasetSizeBytes.longValue()/120 * 1024 * 1024;
}

On 1st example, it gave me 23 partitions with ~190MB file size.

What is the better solution for such problem ?

I tried to control file size written from Dataset write method.

0

There are 0 best solutions below