Deleting from a DeltaTable using a dataframe of keys

1.3k Views Asked by At

I want to perform a delete operation on a DeltaTable, where the keys to be deleted are already present on a DataFrame.

Currently I am collecting the DataFrame on the driver, and then running delete operation. However it seems very inefficient to me.

(Something like below)

val keys = keysDF
            .select("key")
            .map(_.getLong(0))
            .collect()

DeltaTable.forPath(spark, "/path/to/table")
        .delete(col("key").isInCollection(keys))

Is there a more efficient way to achieve this? I was thinking to somehow leverage that my keys are already distributed over the cluster.

1

There are 1 best solutions below

1
On BEST ANSWER

yes - there is a very nice api in delta for it

val keys = keysDF
            .select("key")

val targetDeltaTable = DeltaTable.forPath(spark, path)

targetDeltaTable.alias("t")
      .merge(
        keys.alias("k"),
        "t.key = s.key")
      .whenMatched().delete()
      .execute()