Need to merge small parquet files. I have multiple small parquet files in hdfs. I like to combine those parquet files each to nearly 128 mb each 2. So I read all the files using spark.read() And did repartition() on that and write to the hdfs location
My issue is I have approx 7.9 GB of data, when I did repartition and saved to hdfs it is getting nearly to 22 GB.
I had tied with repartition , range , colasce but not getting the solution
I think that it may be connected with your repartition operation. You are using .repartition(10) so Spark is going to use RoundRobin to repartition your data so probably ordering is going to change. Order of data is important during compresion, you can read more in this question
You may try to add sort or repartition your data by expresion instead of only number of partitions to optimize file size