My code is like
sql = '''
SELECT ...
FROM a
LEFT JOIN b ON ...
LEFT JOIN c ON ...
LEFT JOIN d ON ...
'''
df = spark.sql(sql)
(df
.repartition('col')
.write
.format('parquet')
.mode('overwrite')
.partitionBy('col')
.option(...)
.saveAsTable('...')
)
The final plan shows 2 broadcast joins and 1 SortMergeJoin. The SortMergeJoin is a LEFT JOIN between 100+ and 200+ million row tables. And it has skew. My question is I enabled AQE, and play with some configs (e.g. use spark.sql.shuffle.partitions=40000, spark.default.parallelism=400), but I didn't see AQE coalesce, and not see the AdaptiveSparkPlan node. I saw many examples of AQE are using GROUP BY. Does AQE only works with GROUP BY? Any reason why AdaptiveSparkPlan node is not shown for my query?
Thanks