Buffer messages in stream data for a given messageId

523 Views Asked by Amit Kumar At 09 February 2018 at 20:39

Use case: i have messages having messageId, multiple messages can have same message id, these messages are present in streaming pipeline (like kafka) partitioned by messageId, so i am making sure all the messages with same messageId will go in same partition.

So i need to write a job which should buffer messages for some time (let say 1 minute) and after that time, combine all messages having same messageId to single large message.

I am thinking it can be done using spark Datasets and spark sql (or something else?). But i could not find any example/documentation around how to store messages for some time for a given message id and then do aggregation on these messages.

Original Q&A

There are 1 best solutions below

kellanburket On 12 February 2018 at 18:59

I think what you're looking for is Spark Streaming. Spark has a Kafka Connector that can link into a Spark Streaming Context.

Here's a really basic example that will create an RDD for all messages in a given set of topics over a 1 minute interval, then group them by a message id field (your value serializer would have to expose such a getMessageId method, of course).

SparkConf conf = new SparkConf().setAppName(appName);
JavaStreamingContext ssc = new JavaStreamingContext(conf, Durations.minutes(1));

Map<String, Object> params = new HashMap<String, Object>() {{
    put("bootstrap.servers", kafkaServers);
    put("key.deserializer", kafkaKeyDeserializer);
    put("value.deserializer", kafkaValueDeserializer);
}};

List<String> topics = new ArrayList<String>() {{
    // Add Topics
}};

JavaInputDStream<ConsumerRecord<String, String>> stream =
    KafkaUtils.createDirectStream(ssc,
        LocationStrategies.PreferConsistent(),
        ConsumerStrategies.<String, String>Subscribe(topics, params)
    );

stream.foreachRDD(rdd -> rdd.groupBy(record -> record.value().getMessageId()));

ssc.start();
ssc.awaitTermination();

There's several other ways to group the messages within the streaming API. Look at the documentation for more examples.

Buffer messages in stream data for a given messageId

There are 1 best solutions below

Related Questions in APACHE-KAFKA

Related Questions in STREAMING

Related Questions in SPARK-STREAMING

Related Questions in BUFFERING

Related Questions in APACHE-SAMZA

Trending Questions

Popular # Hahtags

Popular Questions