I want to perform a delete operation on a DeltaTable, where the keys to be deleted are already present on a DataFrame.
Currently I am collecting the DataFrame on the driver, and then running delete operation. However it seems very inefficient to me.
(Something like below)
val keys = keysDF
.select("key")
.map(_.getLong(0))
.collect()
DeltaTable.forPath(spark, "/path/to/table")
.delete(col("key").isInCollection(keys))
Is there a more efficient way to achieve this? I was thinking to somehow leverage that my keys are already distributed over the cluster.
yes - there is a very nice api in delta for it