Apache Flume stuck after ChannelFullException is occured 500 times

102 Views Asked by At

I have flume configuration with rabbitmq source, file channel and solr sink. Sometimes sink becomes so busy and file channel is filling up. At that time ChannelFullException is being thrown by file channel. After 500 number of ChannelFullException are thrown flume stuck and never responds and recover itself. I want to learn that, where does 500 value come from? How can I change it? 500 is strict because when flume stucks, I count exceptions and I find 500 number of ChannelFullException log line everytime.

1

There are 1 best solutions below

0
Cloudkollektiv On

You are walking into a typical producer-consumer problem, where one is working faster than the other. In your case, there are two possibilities (or a combination of both):

  • RabbitMQ is sending messages faster than Flume can process.
  • Solr cannot ingest messages fast enough so that they remain stuck in Flume.

The solution is to send messages more slowly (i.e. throttle RabbitMQ) or tweak Flume so that it can process messages faster. I think the last thing is what you want. Furthermore, the unresponsiveness of Flume is probably caused by the java heap size being full. Increase the heap size and try again until the error disappears.

# Modify java maximum memory size
vi bin/flume-ng
JAVA_OPTS="-Xmx2048m"

Additionally, you can also increase the number of agents, channels, or the capacity of those channels. This would naturally cause a higher footprint on the java heap size, so try that first.

# Example configuration
agent1.channels = ch1
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 10000
agent1.channels.ch1.transactionCapacity = 10000
agent1.channels.ch1.byteCapacityBufferPercentage = 20
agent1.channels.ch1.byteCapacity = 800000

I don't know where the exact number 500 comes from, a wild guess would be that when there are 500 exceptions thrown the java heap size is full and Flume does not respond.

Another possibility is that the default configuration above causes it to be exactly 500. So try tweaking it so if it ends up being different or better, that it does not occur anymore.