I am working with a layered Hudi table architecture. I have a spark structured streaming job that queries a Hudi table, does some transformation and writes to a separate destination Hudi table. Both Tables are on Amazon S3. When the streaming job is paused (maybe due to failure or otherwise) and is restarted, the streaming job resumes with the commit on the Hudi source table it paused with. Depending on how long the streaming job was paused, the commit it paused with could have been archived or cleaned.
How does Spark Structured streaming job pick what commit on the source table to start reading from in the following scenarios?
- The job was previously streaming but paused for a n time(where n is long enough for the last commit read on the job to have been archived) while the source was never paused?
- The job was previously streaming but paused for a n time(where n is NOT long enough for the last commit read on the job to have been archived) while the source was never paused?
- The job is being started for the first time but source table has been streaming for a while?
Where in the table metadata or folder structure does the SSS job store the information of the Hudi source table commit that it read last and will read next?