How does Spark Structured streaming job pick what commit to query on a Hudi source table?

22 Views Asked by Reme At 04 February 2024 at 12:13

I am working with a layered Hudi table architecture. I have a spark structured streaming job that queries a Hudi table, does some transformation and writes to a separate destination Hudi table. Both Tables are on Amazon S3. When the streaming job is paused (maybe due to failure or otherwise) and is restarted, the streaming job resumes with the commit on the Hudi source table it paused with. Depending on how long the streaming job was paused, the commit it paused with could have been archived or cleaned.

How does Spark Structured streaming job pick what commit on the source table to start reading from in the following scenarios?

The job was previously streaming but paused for a n time(where n is long enough for the last commit read on the job to have been archived) while the source was never paused?
The job was previously streaming but paused for a n time(where n is NOT long enough for the last commit read on the job to have been archived) while the source was never paused?
The job is being started for the first time but source table has been streaming for a while?

Where in the table metadata or folder structure does the SSS job store the information of the Hudi source table commit that it read last and will read next?

Original Q&A

How does Spark Structured streaming job pick what commit to query on a Hudi source table?

There are 0 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in SPARK-STREAMING

Related Questions in SPARK-STRUCTURED-STREAMING

Related Questions in APACHE-HUDI

Trending Questions

Popular # Hahtags

Popular Questions