Is watermark based on processing time or event time or both?

903 Views Asked by At

Is watermark in Structured Streaming always set using processing time or event time or both?

1

There are 1 best solutions below

0
On BEST ANSWER

In Structured Streaming 2.2 streaming watermark is tracked based on event time as defined by eventTime column in Dataset.withWatermark operator.

withWatermark Defines an event time watermark for this Dataset. A watermark tracks a point in time before which we assume no more late data is going to arrive.

That gives you event time watermark by default.

But your initial Dataset can have no event time column initially and thus you can auto-generate one using current_date or current_timestamp functions or some other way at processing time. That would give you processing time watermark (based on the custom-generated column).

In the most generic solution using KeyValueGroupedDataset.flatMapGroupsWithState, you can pre-define the strategies or write a custom one. That's why they call it a solution for Arbitrary Stateful Aggregations in Structured Streaming.

flatMapGroupsWithState Applies the given function to each group of data, while maintaining a user-defined per-group state.