RabbitMQ high memory usage

742 Views Asked by At

I'm currently using RabbitMQ (3.10.25) in production with 3 nodes and it contains several queues:

  • one classic queue
  • one quorum queue (to handle NServiceBus commands - NServiceBus.RabbitMQ package 8.0.2)
  • one quorum queue 'error'
  • 28 NServiceBus quorum queues (nsb.v2.delay-level-xx)

The classic queue handles 2 messages per second. The quorum queues do nothing.

After a few hours one node is still stable (used by the classic queue), the other 2 are high on memory usage. The watermark memory limit has been reached. In addition, there is a lot of preallocated unused memory (seems not normal and still increasing). Tables other is 1.2GB as well.. (keeps increasing)

What is the best approach to tackle this issue? Does RabbitMQ or NServiceBus provide a certain setting? to reduce the memory usage?

Some info is available at RabbitMQ, but it is unclear which settings need to be adjusted as starting point.

enter image description here

enter image description here

UPDATE 1: The Write/Sync IO is high as well. (Above screenshots are created while some producers/ consumers are connected. Even with no producer/consumer, memory consumption continues to rise. The IO screenshot is with 0 producers/consumers)

enter image description here

UPDATE 2: I noticed, one of the queues, nsb.v2.verify-stream-flag-enabled, displays 'Cluster is in minority'. Could you explain what it means? Is it causing the described memory issue?

UPDATE 3: A memory breakdown of rabbitmqctl report. As mentioned before, the values of allocated_unused and other_ets are too high.

Total memory used: 3.7986 gb
Calculation strategy: rss
Memory high watermark setting: 0.4 of available memory, computed to: 3.3458 gb
**allocated_unused: 2.4707 gb (65.04 %)
other_ets: 1.2057 gb (31.74 %)**
other_proc: 0.0518 gb (1.36 %)
code: 0.0336 gb (0.89 %)
other_system: 0.0148 gb (0.39 %)
connection_other: 0.0048 gb (0.13 %)
plugins: 0.0043 gb (0.11 %)
quorum_queue_procs: 0.0025 gb (0.07 %)
reserved_unallocated: 0.0024 gb (0.06 %)
binary: 0.002 gb (0.05 %)
atom: 0.0015 gb (0.04 %)
mgmt_db: 0.0012 gb (0.03 %)
metrics: 0.001 gb (0.03 %)
mnesia: 0.0008 gb (0.02 %)
connection_channels: 0.0007 gb (0.02 %)
connection_readers: 0.0004 gb (0.01 %)
connection_writers: 0.0003 gb (0.01 %)
stream_queue_procs: 0.0001 gb (0.0 %)
quorum_ets: 0.0 gb (0.0 %)
msg_index: 0.0 gb (0.0 %)
quorum_queue_dlx_procs: 0.0 gb (0.0 %)
stream_queue_replica_reader_procs: 0.0 gb (0.0 %)
queue_procs: 0.0 gb (0.0 %)
queue_slave_procs: 0.0 gb (0.0 %)
stream_queue_coordinator_procs: 0.0 gb (0.0 %)

UPDATE 4: Top Processes plugin has been installed and I've noticed the next item keeps increasing the memory usage in the TOP ETS tables. After a while it drops and starts increasing again. The memory of the nodes don't seem to be cleaned up...:

  • name: rabbit_stream_coordinator
  • owner name: ra_coordination_log_ets
  • type: set
  • named: false
  • protection: public
  • compressed: false

UPDATE 5: The issue seems to be caused by the stream queue (created by NServiceBus): 'nsb_v2_verify-stream-flag-enabled'

rabbit_stream_coordinator: Error while starting replica for nsb_v2_verify-stream-flag-enabled

could not connect osiris to replica.

2023-10-03 07:58:25.972915+00:00 [error] <0.11581.46>   crasher:
2023-10-03 07:58:25.972915+00:00 [error] <0.11581.46>     initial call: osiris_replica_reader:init/1
2023-10-03 07:58:25.972915+00:00 [error] <0.11581.46>     registered_name: []
2023-10-03 07:58:25.972915+00:00 [error] <0.11581.46>     exception exit: connection_refused
2023-10-03 07:58:25.972915+00:00 [error] <0.11581.46>       in function  gen_server:init_it/6 (gen_server.erl, line 835)
2023-10-03 07:58:25.972915+00:00 [error] <0.11581.46>     ancestors: [osiris_replica_reader_sup,osiris_sup,<0.235.0>]

Additional logging:

2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>   crasher:
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>     initial call: osiris_replica:init/1
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>     registered_name: []
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>     exception error: no case clause matching
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>                      {error,
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>                       {connection_refused,
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>                        {child,undefined,#Ref<0.840260531.4208984066.201970>,
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>                         {osiris_replica_reader,start_link,
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>                          [#{connection_token =>

2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>                             hosts =>

2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>                             name =>
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>                             reference =>
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>                              {resource,

2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>                               queue,<<"nsb.v2.verify-stream-flag-enabled">>},

2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>                             start_offset => {0,empty},
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>                             transport => ssl}]},
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>                         temporary,false,5000,worker,
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>                         [osiris_replica_reader]}}}
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>       in function  osiris_replica_reader:start/2 (src/osiris_replica_reader.erl, line 108)
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>       in call from osiris_replica:handle_continue/2 (src/osiris_replica.erl, line 246)
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>       in call from gen_server:try_dispatch/4 (gen_server.erl, line 1123)
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>       in call from gen_server:loop/7 (gen_server.erl, line 865)
2023-10-03 08:29:37.869613+00:00 [error] <0.10360.34>     ancestors: [osiris_server_sup,osiris_sup,<0.235.0>]

As soon as we delete the queue, the memory usage seems stable and the GC numbers go down.

(Update 5 is tested with RabbitMQ 3.11.20 and NServiceBus nuget package 8.1.3)

After restarting the app (using NServiceBus), the 'nsb_v2_verify-stream-flag-enabled' is re-created and the followers still displays 'Cluster in minority'

enter image description here

2

There are 2 best solutions below

4
Ramon Smits On

The stream flag queue is created at start by NServiceBus to check if the broker supports stream queues which are used indirectly by quorum queues for the timeout infrastructure to be reliable.

The queue itself isn't used for any messaging so that cannot cause any memory issues.

Validate the Rabbitmq logs for any issues and run rabbitmqctl report as suggested by Adam. Also see:

0
Tjad Clark On

With your rabbitmq.conf file (defaults to /etc/rabbitmq/rabbitmq.conf) you can configure the Highwatermark memory to be soft limited to a smaller amount with a relative or absolute value.

By default, the vm_memory_high_watermark is set to use a relative amount, which is 40% of your system RAM.

Absolute limit

vm_memory_high_watermark.absolute= "512M"

Relative limit (10%)

vm_memory_high_watermark.relative = 0.1

Note that these are thresholds for triggering alarms for the flow control of messages to kick in (it's a soft limit), so it might use slightly more memory than is specified, however once the flow control kicks in, no more memory should be consumed until the alarm is cleared (i.e when the memory usage diminishes again)

other_ets is memory that plugins may use to store their state in, you may want to look into any plugins that have been installed.

Go here for better understanding memory usage

Go here for better understanding memory alarms

Go here for better understanding configuration