I am building a pipeline which reads data incrementally from a source table based on the timestamp value, let's say the timestamp column name is warehouse_created_time. The value of this warehouse_created_time is assigned in the source datawarehouse as part of a batch load. The incremental pipeline needs to run multiple times a day.
The problem I am trying to solve is, there can be a situation, when the records of a long running batch will have less warehouse_created_time (as timestamp will be reserved at the start of the batch) than a short batch submitted later. And if the incremental process ran while long batch was still in progress then it will take the committed records of the short job, and therefore the target table will have records which have higher warehouse_created_time than the batch which was still running. The records of the longer batch will never be brought to the target database.
How to handle this situation, or better prevent it from happening ?