pyspark: insert into dataframe if key not present or row.timestamp is more recent

390 Views Asked by At

I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).

I would like to insert the new data if:

  • the key is not present
  • if the key is present, update the row only if the timestamp column of the new row is more recent
1

There are 1 best solutions below

0
On

I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append.

You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).