PySaprk- Perform Merge in Synapse using Databricks Spark

1.1k Views Asked by At

We are having a tricky situation while performing ACID operation using Databricks Spark . We want to perform UPSERT on a Azure Synapse table over a JDBC connection using PySpark . We are aware of Spark providing only 2 mode for writing data . APPEND and OVERWRITE (only these two use full in our case) . So based these two mode we thought of below options:

  1. We will write whole dataframe into a stage table . And we will use this stage table to perform MERGE operation( ~ UPSERT )with final Table .Stage table will be truncated / dropped after that .

  2. We Will bring target table data into Spark also. Inside Spark We will perform MERGE using Delta lake and will generate a final Dataframe .This dataframe will be written back to Target table in OVERWRITE mode.

Considering the cons. sides..

in Option 1 , We have to use two table just to write the final data. And In,case both Stage and target tables are big , then performing MERGE operation inside Synapse is another herculean task and May take time .

in option 2 ,We have to bring the Target table into Spark in-memory. Even though network IO is not much of our concern as both Databricks and Synpse will be in same Azure AZ, It may leads to memory issue in Spark side.

Is there any other feasible options ?? Or any recommendation ??

2

There are 2 best solutions below

0
On

Did you try creating checksum to do merge upsert only for rows that have actual data change?

0
On

Answer would depend on many factors not listed in your question. It's a very open ended question.

(Given the way your question is phrased I'm assuming you're using Dedicated SQL Pools and not an On-demand Synapse)

Here are some thoughts:

  • You'll be using spark cluster's compute in option 1 and Synapse' compute in option 2. Compare cost.
    • Pick the lower cost.
  • Read and write to/from Spark to/from Synapse using their driver uses Datalake as stage. I.e. while reading a table from Synapse into a datafrmae in Spark, driver will first make Synapse export data to Datalake (as parquet IIRC) and then read the files in Datalake to create the Dataframe. This scales nicely if you're talking about 10s or million or billions of rows. But the overhead could become a performance overhead if row counts are low (10-100s of thousands).
    • Test and pick the faster one.
  • Remember that Synapse is not like a traditional MySQL or SQL-Server. It's an MPP DB.
    • "performing MERGE operation inside Synapse is another herculean task and May take time" is a wrong statement. It scales just like a Spark cluster.
    • It may leads to memory issue in Spark side, yes and no. One one hand all data isn't going to be loaded into a single worker node. OTOH yes, you do need enough memory for each node to do it's own part.
  • Although Synapse can be scaled up and down dynamically, I've seen it take up to 40 minutes to complete a scale up. Databricks on the other hand is fully on-demand and you can probably get away with turning on cluster, do upsert, shutdown cluster. With Synapse you'll probably have other clients using it, so may not be able to shut it down.
    • So with Synapse either you'll have to live with 40-80 minutes down time for each upsert (scale up, upsert, scale down), OR
    • pay for high DWU flat-rate all the time, though your usage is high only when you upsert but otherwise it's pretty low.
  • Lastly, remember that MERGE is in preview at the time of writing this. Means no Sev-A support cases/immediate support if something breaks in your prod because you're using MERGE.
    • You can always use DELETE + INSERT instead. Assumes the delta you receive has all columns from target table and not just updated ones.