Pre-populate a bronze delta table from a silver table using a batch job, then stream to it from the same table

72 Views Asked by At

I have a pipeline like this:

kafka->bronze->silver

The bronze and silver tables are Delta Tables. I'm streaming from bronze to silver using regular spark structured-streaming.

I changed the silver schema, so I want to reload from the bronze into silver using the new schema. Unfortunately, the reload is taking forever, and I'm wondering if I can load the data more quickly using a batch job, and then turn the stream back on.

I am concerned that the checkpoint will tell the stream from bronze->silver to pick up where it left off and it will write a bunch of duplicates that I will then need to remove. Is there a way I can advance the checkpoint with the batch load, or play other tricks?

Will that be faster than just letting the stream run? I get the feeling that it is spending a lot of resources writing microbatch transactions.

Any suggestions greatly appreciated!!!

0

There are 0 best solutions below