I'm working on a Streaming Pipeline in Beam, and one of my PTransforms requires a Side input which needs to be refreshed every hour.
Despite reading the docs, I am having trouble understanding if my approach is correct.
For context: The side input data comes from BigQuery, It should be PCollection[String], which will later be aggregated into a PCollectionView[Set[String]]
So I am creating a FixedWindow of 1 hour with triggering of AfterProcessingTime.pastFirstElementInPane()
My concern is that I'm not sure what happens during the first hour the pipeline is executed at.
Here is the code applying the windowing and triggering strategy:
pipeline
.apply(
"ReadFromBQ",
// elided
)
.apply(
"ApplyWindow",
Window
.into[String](FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterProcessingTime.pastFirstElementInPane())
.withAllowedLateness(Duration.standardSeconds(30))
)
Supposing we have this idealised timeline:
00:00:00 -> Pipeline starts
00:00:00 -> 1st Window starts
00:01:00 -> First element arrives into pipeline and is placed into the window
00:01:05 -> Last element arrives into the pipeline and is placed into the window
00:59:59 -> Window closes
01:00:00 -> next window starts
Would the first pane be emitted at approximately 00:01:05 when the last element arrived or 00:59:59 when the window closes? Assume that the entire BQ result is very small and will be read into the pipeline in less than a few seconds.
If the first pane is emitted only after the window closes, would it mean that the transform requiring the input won't run until 1 hour has passed?