pyspark: insert into dataframe if key not present or row.timestamp is more recent

386 Views Asked by Federico Ponzi At 17 August 2025 at 10:58

I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).

I would like to insert the new data if:

the key is not present
if the key is present, update the row only if the timestamp column of the new row is more recent

Original Q&A

There are 1 best solutions below

Bernhard Stadler On 09 November 2018 at 13:35

I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append.

You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).

pyspark: insert into dataframe if key not present or row.timestamp is more recent

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in APACHE-KUDU

Trending Questions

Popular # Hahtags

Popular Questions