spark repartition issue for filesize

184 Views Asked by At

Need to merge small parquet files. I have multiple small parquet files in hdfs. I like to combine those parquet files each to nearly 128 mb each 2. So I read all the files using spark.read() And did repartition() on that and write to the hdfs location

My issue is I have approx 7.9 GB of data, when I did repartition and saved to hdfs it is getting nearly to 22 GB.

I had tied with repartition , range , colasce but not getting the solution

1

There are 1 best solutions below

0
On

I think that it may be connected with your repartition operation. You are using .repartition(10) so Spark is going to use RoundRobin to repartition your data so probably ordering is going to change. Order of data is important during compresion, you can read more in this question

You may try to add sort or repartition your data by expresion instead of only number of partitions to optimize file size