I am using spark streaming (coding in Java), and I want to understand how I can make the algorithm for the following problem. I am relatively new at map-reduce and need some assistance in designing the algorithm.
Here is the problem in details.
Problem Details
Input:
- Initial input of my text is:
(Pattern),(timestamp, message)
(3)(12/5/2014 01:00:01, message)
where 3 is a pattern type
- I have converted this into DStream with Key=P1,P2, where P1 and P2 are some pattern classes for the input line, and Value = Pattern, input, timestamp. Hence each tuple of the DStream is as follows:
Template: (P1,P2),(pattern_id, timestamp, string)
Example: (3,4),(3, 12/5/2014 01:00:01, message)
Here 3 and 4 are pattern types which are paired together.
- I have a model where each pair has a time difference associated with it. For example, the model is in HashMap with the key-values as :
Template: (P1,P2)(Time Difference)
Example: (3,4)(2:20)
where the time difference in the model is 2:20, and if we have two messages of pattern of type 3 and 4 respectively in the stream, the program should output an anomaly if the difference between the two messages is greater than 2:20.
What would be the best way to model this in spark streaming?
What I have tried so far
- I created a DStream as shown in step 2 from step 1
- Created a broadcast variable for sending the map learnt in the model (step 3 above) to all the workers
- I am stuck at trying to figure out the algorithm on how to generate anomaly stream in spark streaming. Cannot figure out how to make this an associative function to do a stream operation
Here is the current code: https://gist.github.com/nipunarora/22c8e336063a2a1cc4a9