I'm encountering an issue with a merge operation in a notebook, where I'm accessing tables from a lakehouse. The merge command fails with a duplicate error. However, when I query the table using SQL Server Management Studio (SSMS) connected to the lakehouse, it shows zero duplicates. I suspected a caching problem and attempted to resolve it by disabling the cache using the following code:
spark.conf.set("spark.synapse.vegas.useCache", "false")
df.cache()
df.unpersist()
I also manually switched environments within the notebook and found no duplicates. The perplexing aspect is that the issue persists when the notebook is triggered via a pipeline, even though there are no duplicates when tested manually. What could be the potential reasons behind this discrepancy, and how can it be addressed?
spark.conf.set("spark.synapse.vegas.useCache", "false")
df.cache()
df.unpersist()
Sample merge code looks like below
%%sql
MERGE INTO lakehouse2.table1 AS Target
USING ViewCreatedOnTable2InAnotherLakeHouse AS Source
ON Target.Col1=Source.Col1 and Target.Col2=Source.Col2
When matched then update set
Target.col3=Source.Colum3,
Target.Col4=Source.Col4
when Not Matched then Insert
(col1,col2,col3,col4)
Values(Source.col1,Source.Col2,Source.Col3,Source.Col4)
the notebook is using sparksql and scala