How wide transformations are influenced by shuffle partition config

116 Views Asked by Mandroid At 24 September 2022 at 09:50

How does wide transformations actually work based on shuffle partitions configuration?

If I have following program:

spark.conf.set("spark.sql.shuffle.partitions", "5")
val df = spark
    .read
    .option("inferSchema", "true")
    .option("header", "true")
    .csv("...\input.csv")
df.sort("sal").take(200)

Does it mean sort would output 5 new partitions(as configured), and then spark takes 200 records from those 5 partitions?

Original Q&A

There are 1 best solutions below

M_S On 06 October 2022 at 19:32

As mentioned in comment your sample code is not affected because this sort is not going to trigger shuffle, in plan you will find something like this

 == Physical Plan ==
 TakeOrderedAndProject (2)
 +- Scan csv  (1)

But for example when you do some join later (or any other wide transformation which will trigger shuffle) you can see that during exchange value from this parameter is going to be used (check number of partitions row)

This may not be the case when adaptive query execution is enabled, in such situation it may look like this

Now you can see that at the beginning value from spark.sql.shuffle.partitions was used but later due to AQE Spark changed plan and on shuffle read number of partitions was changed to 8 (you may also see that SMJ was changed to broadcast hash join - it was also done by AQE)

How wide transformations are influenced by shuffle partition config

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-DATASET

Related Questions in SPARK-SHUFFLE

Trending Questions

Popular # Hahtags

Popular Questions