Zombie processes of PPID 1 and Error - Failed to get properties: Activation of org.freedesktop.systemd1 timed out

684 Views Asked by At

I have around 8000 zombie processes with PPID 1. 90% of these are defunct nrpe processes, the original nrpe process is running fine, it's PID is actually the PGID and SID of all the defunct nrpe processes. The remaining 10% of defunct processes are salt-minion and sshd, both again having PPID 1.

3 days ago there were 0 zombies, then starting 21st July there were 2000 zombies spawned (1900 were nrpe) and they are multiplying at an accelerated rate.

# top
top - 08:18:34 up 296 days, 22:07,  1 user,  load average: 2.07, 2.04, 1.85
Tasks: 7659 total,   1 running, 173 sleeping,   0 stopped, 7485 zombie
%Cpu(s): 23.5 us,  1.4 sy,  0.0 ni, 74.6 id,  0.2 wa,  0.0 hi,  0.1 si,  0.1 st
KiB Mem : 32779896 total,   220536 free, 29965856 used,  2593504 buff/cache
KiB Swap:        0 total,        0 free,        0 used.   684308 avail Mem
# ps -ef | grep defunct | grep Jul21 | wc -l
2108
# ps -ef | grep defunct | grep Jul22 | wc -l
4063

Sample output of ps -ajx | grep defunct

    1   302  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   304  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   311  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   315   309   309 ?           -1 Z       74   0:00 [sshd] <defunct>
    1   323  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   325  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   351  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   358  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   370  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   372  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   375  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   388  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   389  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   392   386   386 ?           -1 Z       74   0:00 [sshd] <defunct>
    1   395  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   409  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   411  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   412  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   414  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   426  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   428  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   440  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   460  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   462  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   464  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>

And my /var/log/messages is filled by this log -

Jul 23 09:36:59 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:36:59 xxxxxhostnamexxxxx dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:36:59 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:37:14 xxxxxhostnamexxxxx sshd[8908]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx newrelic-infra: time="2020-07-23T09:37:24Z" level=error msg="unable to get systemd service status" error="exit status 1"
Jul 23 09:37:25 xxxxxhostnamexxxxx dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:37:25 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out

journalctl -xe too shows similar errors.

This has been a recurring problem from some months now where a random production host has a very large number of nrpe zombies and non-functioning systemd. To remediate, I reboot the server. (It's an AWS EC2 instance) but I badly want to understand what's happening here. Any pointers, thoughts would be of immense help.

OS - CentOS Linux release 7.1.1503 (Core)

# systemctl --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN
0

There are 0 best solutions below