Suppose I have a dataframe of 10GB with one of the column's "c1" having same value for every record. Each single partition is maximum 128 MB(default value). Suppose i call repartition($"c1"), then will all the records be shuffled to the same partition? If so, wouldn't it exceed the maximum size per partition? How would repartition work in this case?
2
There are 2 best solutions below
0

The value of 128 MB comes from the spark property spark.sql.files.maxPartitionBytes
which is only applicable when you create a dataframe after reading a file based source. Refer there for details https://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options. This is to achieve max parallelism while reading. So, if you create a dataframe after transforming another dataframe or joining two dataframe, the partitions are not affect be this value. For example, you can read 10 GB of data and to a df.repartition(1)
and this should work without any issues(assuming your executor has enough memory)
The configuration
spark.sql.files.maxPartitionBytes
is effective only when reading files from file-based sources. So when you executerepartition
, you reshuffle your existing Dataframe and the number of output partitions will be defined byrepartition
logic, which in your case will be 1.