Spark streaming + kafka throughput

1.1k Views Asked by At

In my spark application I'm reading from kafka topic. This topic has 10 partitions so I've created 10 receivers with one thread per receiver. With such configuration I can can observe weird behavior of the receivers. Median rates for these consumers are:

Receiver-0 node-1 10K
Receiver-1 node-2 2.5K
Receiver-2 node-3 2.5K
Receiver-3 node-4 2.5K
Receiver-4 node-5 2.5K
Receiver-5 node-1 10K
Receiver-6 node-2 2.6K
Receiver-7 node-3 2.5K
Receiver-8 node-4 2.5K
Receiver-9 node-5 2.5K

Problem 1: node-1 is receiving as many messages as the other 4 together.

Problem 2: App is not reaching batch performance limit(30 sec batches are computed in median time of 17 sec). I would like it to consume enough messages to make this at least 25 sec of computation time.

Where I should look for the bottleneck ?

To be clear, there are more messages to be consumed.

@Edit: I had lag on only two partitions, so the first problem is solved. Still, reading 10k msgs per second is not very much.

1

There are 1 best solutions below

0
On

Use Sparks built in backpressure (since Spark 1.5, which wasn't available at the time of your question): https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-streaming-backpressure.adoc

Just set

spark.streaming.backpressure.enabled=true
spark.streaming.kafka.maxRatePerPartition=X (really high in your case)

To find the bottleneck you should use the WebUI of Sparkstreaming and look at the DAG of the process taking most of the time...