Use case: i have messages having messageId, multiple messages can have same message id, these messages are present in streaming pipeline (like kafka) partitioned by messageId, so i am making sure all the messages with same messageId will go in same partition.
So i need to write a job which should buffer messages for some time (let say 1 minute) and after that time, combine all messages having same messageId to single large message.
I am thinking it can be done using spark Datasets and spark sql (or something else?). But i could not find any example/documentation around how to store messages for some time for a given message id and then do aggregation on these messages.
I think what you're looking for is Spark Streaming. Spark has a Kafka Connector that can link into a Spark Streaming Context.
Here's a really basic example that will create an RDD for all messages in a given set of topics over a 1 minute interval, then group them by a message id field (your value serializer would have to expose such a
getMessageIdmethod, of course).There's several other ways to group the messages within the streaming API. Look at the documentation for more examples.