I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).
I would like to insert the new data if:
- the key is not present
- if the key is present, update the row only if the timestamp column of the new row is more recent
I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with
SaveMode.Append
.You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).