We have a Scalatest + Spark based Test. The Data transformation is fairly complex with lots of transformation and the final step is Delta merge.
The tests have minimal input data (around 20 KB) and produce a small Delta table (.parquet is around 10 KB) On profiling, it was found that the below code take around 90% of the total time execution (around 3 minutes). Note that the Delta table was empty before the merge.
deltaTable.as("existing")
.merge(dfNewData.as("new"), "new.GUID = existing.GUID and new.RSid = existing.RSid")
.whenMatched("existing.GUID <> 'U'")
.update(Map("Id1" -> col("new.Id1"),... => around 20 columns
))
.whenMatched("existing.GUID = 'U' and new.Id1 = existing.Id1")
.delete()
.whenNotMatched()
.insertAll()
.execute()
The time taken is almost identical in Windows Intellij Setup and also in Azure DevOps CI.
Any idea why it is talking such a long time for such small amount of data and how can it be improved?
Per the comments, it takes 3m to do the merge, and 1.5m to do a simple insert/append.
If the dfNewData involves joins, calculations etc. you may incur two runs of the dfNewData plan, caching should reduce that. Note caching can be more expensive than writing to a temporary location and using that to read from.
Please attempt: