It is really important for my application to always emit a "window finished" message, even if the window was empty. I cannot figure out how to do this. My initial idea was to output an int for each record processed and use Sum.integersGlobally
and then emit a record based off that, giving me a singleton per window, I could then simply emit one summary record per window, with 0 if the window was empty. Of course, this fails, and you have to use withoutDefaults
which will then emit nothing if the window was empty.
How can I emit summary data for each window even if a given window was empty?
121 Views Asked by mr blobby At
1
Cloud Dataflow is built around the notion of processing data that is likely to be highly sparse. By design, it does not conjure up data to fill in those gaps of sparseness, since this will be cost prohibitive for many cases. For a use case like yours where non-sparsity is practical (creating non-sparse results for a single global key), the workaround is to join your main
PCollection
with a heartbeatPCollection
consisting of empty values. So for the example ofSum.integersGlobally
, you wouldFlatten
your mainPCollection<Integer>
with a secondaryPCollection<Integer>
that contains exactly one value of zero per window. This assumes you're using an enumerable type of window (e.g.FixedWindows
orSlidingWindows
;Sessions
are by definition non-enumerable).Currently, the only way to do this would be to write a data generator program that injects the necessary stream of zeroes into Pub/Sub with timestamps appropriate for the type of windows you will be using. If you write to the same Pub/Sub topic as your main input, you won't even need to add a
Flatten
to your code. The downside is that you have to run this as a separate job somewhere.In the future (once our Custom Source API is available), we should be able to provide a
PSource
that accepts an enumerableWindowFn
plus a default value and generates an appropriate unboundedPCollection
.