merge collection with different window strategy

71 Views Asked by At

we have 3 different data sources which eventually we need to do some kind of inner join between them we create all pcollections with group by key pcollectionA - Implemented using state (the data is not changed) pcollectionB - windowed for 5h. if another event arrives within that time we would like to increase the window in another 5h hours. Implemented using custom window pcollectionC - windowed for 30 minutes. Implemented using fixed window.

our purpose is to send pcollectionC events only if the relevant key exist in pcollectionA & pcollectionB

what is the best way to implement it as regular joins are not working due do the window differences

1

There are 1 best solutions below

0
On

From your question I am assuming that the first two streams (pcollectionA and pcollectionB) are more like control/reference data used by the stream processor of pcollectionC.

Have you tried the Broadcast State Pattern? You can convert pcollectionA and pcollectionB to Broadcast streams and then look them up in the processor of pcollectionC.

Here is a very high level design:

// broadcast pcollectionA
BroadcastStream<> A_BroadcastStream = pcollectionA_stream
                        .broadcast(stateDescriptor);

// broadcast pcollectionB
BroadcastStream<> B_BroadcastStream = pcollectionB_stream
                        .broadcast(stateDescriptor);

//Process pcollectionC with broadcast data from A and B.

DataStream<String> output = pcollectionC_Stream.
                 .connect(A_BroadcastStream)
                 .process(
                      // process/join with pcollectionA items
                  )
                 .connect(B_BroadcastStream)
                 .process(
                     // process/join with pcollectionB items
                  )

This documentation has a great example for this design pattern - https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/broadcast_state/