There are two C++ processes, one thread in each process. The thread handles network traffic (Diameter) from 32 incoming TCP connections, parses it and forwards split messages via 32 outgoing TCP connections. Let's call this C++ process a DiameterFE.
If only one DiameterFE process is running, it can handle 70 000 messages/sec.
If two DiameterFE processes are running, they can handle 35 000 messages/sec each, so the same 70 000 messages/sec in total.
Why don't they scale? What is a bottleneck?
Details:
There are 32 Clients (seagull) and 32 servers (seagull) for each Diameter Front End process, running on separate hosts.
A dedicated host is given for these two processes - 2 E5-2670 @ 2.60GHz CPUs x 8 cores/socket x 2 HW threads/core = 32 threads in total.
10 GBit/sec network.
Average Diameter message size is 700 bytes.
It looks like only the Cpu0 handles network traffic - 58.7%si. Do I have to explicitly configure different network queues to different CPUs?
The first process (PID=7615) takes 89.0 % CPU, it is running on Cpu0.
The second process (PID=59349) takes 70.8 % CPU, it is running on Cpu8.
On the other hand, Cpu0 is loaded at: 95.2% = 9.7%us + 26.8%sy + 58.7%si,
whereas Cpu8 is loaded only at 70.3% = 14.8%us + 55.5%sy
It looks like the Cpu0 is doing the work also for the second process. There is very high softirq and only on the Cpu0 = 58.7%. Why?
Here is the top output with key "1" pressed:
top - 15:31:55 up 3 days, 9:28, 5 users, load average: 0.08, 0.20, 0.47
Tasks: 973 total, 3 running, 970 sleeping, 0 stopped, 0 zombie
Cpu0 : 9.7%us, 26.8%sy, 0.0%ni, 4.8%id, 0.0%wa, 0.0%hi, 58.7%si, 0.0%st
...
Cpu8 : 14.8%us, 55.5%sy, 0.0%ni, 29.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
...
Cpu31 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 396762772k total, 5471576k used, 391291196k free, 354920k buffers
Swap: 1048568k total, 0k used, 1048568k free, 2164532k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7615 test1 20 0 18720 2120 1388 R 89.0 0.0 52:35.76 diameterfe
59349 test1 20 0 18712 2112 1388 R 70.8 0.0 121:02.37 diameterfe
610 root 20 0 36080 1364 1112 S 2.6 0.0 126:45.58 plymouthd
3064 root 20 0 10960 788 432 S 0.3 0.0 2:13.35 irqbalance
16891 root 20 0 15700 2076 1004 R 0.3 0.0 0:01.09 top
1 root 20 0 19364 1540 1232 S 0.0 0.0 0:05.20 init
...
The fix of this issue was to upgrade the kernel to 2.6.32-431.20.3.el6.x86_64 .
After that network interrupts and message queues are distributed among different CPUs.