Apache Spark What happens when repartition($"key") is called when size of all records per key is greater than the size of a single partition?

837 Views Asked by Arjunlal M.A At 16 June 2025 at 16:25

Suppose I have a dataframe of 10GB with one of the column's "c1" having same value for every record. Each single partition is maximum 128 MB(default value). Suppose i call repartition($"c1"), then will all the records be shuffled to the same partition? If so, wouldn't it exceed the maximum size per partition? How would repartition work in this case?

Original Q&A

There are 2 best solutions below

Gabio On 23 September 2021 at 10:36 BEST ANSWER

The configuration spark.sql.files.maxPartitionBytes is effective only when reading files from file-based sources. So when you execute repartition, you reshuffle your existing Dataframe and the number of output partitions will be defined by repartition logic, which in your case will be 1.

Vikas Saxena On 23 September 2021 at 10:36

The value of 128 MB comes from the spark property spark.sql.files.maxPartitionBytes which is only applicable when you create a dataframe after reading a file based source. Refer there for details https://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options. This is to achieve max parallelism while reading. So, if you create a dataframe after transforming another dataframe or joining two dataframe, the partitions are not affect be this value. For example, you can read 10 GB of data and to a df.repartition(1) and this should work without any issues(assuming your executor has enough memory)

Apache Spark What happens when repartition($"key") is called when size of all records per key is greater than the size of a single partition?

There are 2 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in APACHE-SPARK-SQL-REPARTITION

Trending Questions

Popular # Hahtags

Popular Questions