Currently, we have a keyed Window flink job that works as expected. Events come into the window - some processing is done in a reduce function - a trigger causes output to a sink.
Now we have a scenario where the volume of events will be scaled up and the length of time the keyed window needs to be kept open goes from a few minutes to hours/days.
While job definitions could be more or less the same as before, additional storage requirements come into play. Also requiring horizontal scaling, such as scale out the task slots, increase the storage, etc.
Is this the recommended approach - i.e., scale out to support the increased volume of events and longer window times?
That somehow feels wasteful since, the longer window times also imply that window remains needlessly open when there are NO events coming into it? Thus, is there a way in Flink to seamlessly move the job state to disk during idle periods and bring it back when fresh events show up in the window? Something like a savepoint per window?
For windows being processed by a reduce function, the state involved should be just a single object (per key, per window), so hopefully the total volume of state involved won't be outrageous.
You can always use the RocksDB state backend, in which case the state will be kept on disk, with the active state in off-heap memory.