spark repartition issue for filesize

179 Views Asked by pavan kumar At 18 August 2025 at 10:56

Need to merge small parquet files. I have multiple small parquet files in hdfs. I like to combine those parquet files each to nearly 128 mb each 2. So I read all the files using spark.read() And did repartition() on that and write to the hdfs location

My issue is I have approx 7.9 GB of data, when I did repartition and saved to hdfs it is getting nearly to 22 GB.

I had tied with repartition , range , colasce but not getting the solution

Original Q&A

There are 1 best solutions below

M_S On 17 December 2022 at 13:23

I think that it may be connected with your repartition operation. You are using .repartition(10) so Spark is going to use RoundRobin to repartition your data so probably ordering is going to change. Order of data is important during compresion, you can read more in this question

You may try to add sort or repartition your data by expresion instead of only number of partitions to optimize file size

spark repartition issue for filesize

There are 1 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in HDFS

Related Questions in APACHE-SPARK-SQL-REPARTITION

Trending Questions

Popular # Hahtags

Popular Questions