How can I broadcast a dstream computed over a window? For instance, for the last 10 minute I find the subset of lines satisfying a condition (call it send_events dstream). I need to find a set of lines satisfying another condition (call it ack_events_for_send_events dstream) in the last 10 minutes using the send_events dstream. I do not want to groupbykey due to large shuffling. When I do groupbykey, the size of each group is very small like at most 10. In other words, I have lots of groups (I am not sure if this helps to optimize my operations. Just wanted to share.)
Example:
id1, type1, time1
id1, type2, time3
id2, type1, time5
id1, type1, time2
id2, type2, time4
id1, type2, time6
I want to find the minimum time difference between type1 and type2 per id. Each id has at most 10 lines, but I have 10,000 ids in a given window
Maybe the following would work?
Then in somefunc: