ejabberd server overloads at 1080 active users

179 Views Asked by At

We have an XMPP system used by our software that uses an ejabberd server to send realtime messages. Think of this as a 2010 era homegrown version of Firebase Cloud Messaging.

We recently updated from ejabberd-16 to ejabberd-22.10 (Had to jump because of LetsEncrypt issues with v18 through v20).

Our normal load is 3000 to 4000 active users.

Since the upgrade, when our server gets up above 1000 active users. the running processes of beam.smp explode. Each one takes 10-20% of CPU which pulls our server down. I can fix this by turning off ejabberd for a few minutes and restarting it, which kicks the number of active users lower. But I really need to get back to our full volume of 3000-4000 active users.

top - 08:05:09 up 20:50,  2 users,  load average: 40.03, 22.40, 15.82
Tasks: 643 total,  11 running, 497 sleeping,   0 stopped,   0 zombie
%Cpu(s): 61.1 us, 35.8 sy,  0.0 ni,  0.1 id,  0.0 wa,  0.0 hi,  0.4 si,  2.7 st
KiB Mem : 16367432 total,   186740 free,  3427940 used, 12752752 buff/cache
KiB Swap:   262140 total,   258300 free,     3840 used. 12440420 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
11019 ejabberd  20   0 2781448  38864  12584 S  19.0  0.2   0:00.64 beam.smp
10096 ejabberd  20   0 2787856  45624  15536 S  15.7  0.3   0:01.10 beam.smp
10543 ejabberd  20   0 2781700  39608  13056 S  15.7  0.2   0:00.74 beam.smp
10678 ejabberd  20   0 2783768  39916  12892 S  15.4  0.2   0:00.66 beam.smp
10749 ejabberd  20   0 2781712  39396  14616 S  14.8  0.2   0:00.87 beam.smp
10745 ejabberd  20   0 2782452  37120  12688 S  12.8  0.2   0:00.50 beam.smp
 2088 ejabberd  20   0 2893856 148116  44624 S  12.5  0.9  11:26.30 beam.smp
10755 ejabberd  20   0 2785552  40760  12472 S  12.1  0.2   0:00.44 beam.smp
 9260 ejabberd  20   0 2786804  49224  17136 S  11.5  0.3   0:00.95 beam.smp
11319 ejabberd  20   0 2782480  31788  11204 S  11.1  0.2   0:00.34 beam.smp
10093 ejabberd  20   0 2782224  42140  15008 S  10.8  0.3   0:00.91 beam.smp
 9986 ejabberd  20   0 2782704  43572  15112 S  10.5  0.3   0:00.87 beam.smp
10169 ejabberd  20   0 2782736  38956  12904 S   9.8  0.2   0:00.73 beam.smp
10407 ejabberd  20   0 2781700  39708  13052 S   9.8  0.2   0:00.72 beam.smp

What configuration am I missing to get my active users higher. We are using mnesia database and wish to keep using it.

1

There are 1 best solutions below

1
Badlop On

I have not a clear answer, so I'll give several ideas, hoping that one will point to you to something useful. If you don't get yet any clue, you can update your original post answering those questions, and somebody else may get some clue.

A) Around 1000 concurrent user connections? What a curious number, it remembers me to the "ulimit -n" which was by default 1024, see https://www.ejabberd.im/benchmark/index.html

B) You are now using Mnesia. I imagine it was also being used in the old deployment, so probably this isn't the problem

C) Are you using some custom module not included in the standard ejabberd? Maybe from ejabberd-contrib or elsewhere. Maybe it has some limit, or some incompatibilty with the new ejabberd version.

D) Are those clients idle (and just consuming the TCP connection and some RAM), or are they actively doing things (like sending messages to MUC rooms, or changing presences, which consume CPU)?

E) Do all the users use the same XMPP client? Maybe that client behaves strangely with the new ejabberd version.

F) Does the problem increase slowly from 1 client up to 1000? Or does the problem appear suddenly around 1000 connections?

G) BEAM is a virtual machine, which internally has "erlang processes" that you can look using something similar to "top". Maybe there is some erlang process or a few of them consuming all this CPU...

I can think two methods to view the erlang processes that exist inside the erlang virtual machine:

An easy method is using the "etop" tool. Simply run:

ejabberdctl etop

Alternatively, you can install ejabberd_observer_cli which provides more details:

1 Install it:

ejabberdctl modules_update_specs
ejabberdctl module_install ejabberd_observer_cli

2 Now run

ejabberdctl debug

3 in that shell run:

ejabberd_observer_cli:start().

4 press H and then Enter to view the Home screen

What you are looking for: processes that have a lot of Reds/Reductions, which means they are executing many functions many times; or that have a large Message Queue, which means they are saturated and can't handle the load fast enought; or that consume a lot of memory.