I'm new to Kafka and I would like to do the following:
- I have a bunch of servers that push some data every 10 minutes to Kafka.
- I have a spark application that needs the latest data pushed by all the servers.
E.g.: I have 2 servers that push, respectively 'a'
and 'b'
. I need the spark app to receive in a dataframe the values 'a'
and 'b'
so that they can be processed together. 10 minutes later, the 2 servers push 'c'
and 'd'
. The spark app should receive the values 'c'
and 'd'
at the same time, etc.
My spark application needs all the latest data pushed, so I believe that a streaming approach is not correct and that maybe a batch approach (or maybe it's called differently) should be taken.
My Spark app expects a DataFrame
.
Your problem does not sound like an usual Kafka use case. However if using Kafka is a must you can use Kafka topics to group data. By creating topics
A_B
andC_D
you assure that values 'a' and 'b' will be consumed together and separated from 'c' and 'd' values. Then your Spark app must verify that it got all needed data fromA_B
andC_D
and proceeds with execution. This design will work if your Spark application is able to buffer all data and determine when all needed messages were consumed.