Spark Delta Merge runs too slow in Test

267 Views Asked by Ganesha At 21 September 2023 at 11:36

We have a Scalatest + Spark based Test. The Data transformation is fairly complex with lots of transformation and the final step is Delta merge.

The tests have minimal input data (around 20 KB) and produce a small Delta table (.parquet is around 10 KB) On profiling, it was found that the below code take around 90% of the total time execution (around 3 minutes). Note that the Delta table was empty before the merge.

deltaTable.as("existing")
            .merge(dfNewData.as("new"), "new.GUID = existing.GUID and new.RSid = existing.RSid")
            .whenMatched("existing.GUID <> 'U'")
            .update(Map("Id1" -> col("new.Id1"),... => around 20 columns
            ))
            .whenMatched("existing.GUID = 'U' and new.Id1 = existing.Id1")
            .delete()
            .whenNotMatched()
            .insertAll()
            .execute()

The time taken is almost identical in Windows Intellij Setup and also in Azure DevOps CI.

Any idea why it is talking such a long time for such small amount of data and how can it be improved?

Original Q&A

There are 1 best solutions below

Chris On 27 September 2023 at 10:32

Per the comments, it takes 3m to do the merge, and 1.5m to do a simple insert/append.

If the dfNewData involves joins, calculations etc. you may incur two runs of the dfNewData plan, caching should reduce that. Note caching can be more expensive than writing to a temporary location and using that to read from.

Please attempt:


val cached = dfNewData.cache

deltaTable.as("existing")
            .merge(cached.as("new"), "new.GUID = existing.GUID and new.RSid = existing.RSid")
            .whenMatched("existing.GUID <> 'U'")
            .update(Map("Id1" -> col("new.Id1"),... => around 20 columns
            ))
            .whenMatched("existing.GUID = 'U' and new.Id1 = existing.Id1")
            .delete()
            .whenNotMatched()
            .insertAll()
            .execute()

Spark Delta Merge runs too slow in Test

There are 1 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in SCALATEST

Related Questions in DELTA-LAKE

Trending Questions

Popular # Hahtags

Popular Questions