I have around 8000 zombie processes with PPID 1. 90% of these are defunct nrpe processes, the original nrpe process is running fine, it's PID is actually the PGID and SID of all the defunct nrpe processes. The remaining 10% of defunct processes are salt-minion and sshd, both again having PPID 1.
3 days ago there were 0 zombies, then starting 21st July there were 2000 zombies spawned (1900 were nrpe) and they are multiplying at an accelerated rate.
# top
top - 08:18:34 up 296 days, 22:07, 1 user, load average: 2.07, 2.04, 1.85
Tasks: 7659 total, 1 running, 173 sleeping, 0 stopped, 7485 zombie
%Cpu(s): 23.5 us, 1.4 sy, 0.0 ni, 74.6 id, 0.2 wa, 0.0 hi, 0.1 si, 0.1 st
KiB Mem : 32779896 total, 220536 free, 29965856 used, 2593504 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 684308 avail Mem
# ps -ef | grep defunct | grep Jul21 | wc -l
2108
# ps -ef | grep defunct | grep Jul22 | wc -l
4063
Sample output of ps -ajx | grep defunct
1 302 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 304 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 311 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 315 309 309 ? -1 Z 74 0:00 [sshd] <defunct>
1 323 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 325 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 351 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 358 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 370 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 372 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 375 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 388 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 389 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 392 386 386 ? -1 Z 74 0:00 [sshd] <defunct>
1 395 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 409 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 411 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 412 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 414 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 426 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 428 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 440 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 460 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 462 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 464 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
And my /var/log/messages is filled by this log -
Jul 23 09:36:59 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:36:59 xxxxxhostnamexxxxx dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:36:59 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:37:14 xxxxxhostnamexxxxx sshd[8908]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx newrelic-infra: time="2020-07-23T09:37:24Z" level=error msg="unable to get systemd service status" error="exit status 1"
Jul 23 09:37:25 xxxxxhostnamexxxxx dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:37:25 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
journalctl -xe too shows similar errors.
This has been a recurring problem from some months now where a random production host has a very large number of nrpe zombies and non-functioning systemd. To remediate, I reboot the server. (It's an AWS EC2 instance) but I badly want to understand what's happening here. Any pointers, thoughts would be of immense help.
OS - CentOS Linux release 7.1.1503 (Core)
# systemctl --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN