Frequently high cpu load on netdata installation in Docker environment

548 Views Asked by At

We are running netdata in a docker environment on big machines (64GB, 10 CPUs) with many machines (>40) running the same setup including postgres, mongo, tomcat, httpd, solr.

Inside each machine we have a netdata service which collects detailed data and sends it to a central netdata instance. We are running 6 of such big machines on two different data centers.

Everything works fine: there is only one strange problem we are facing: - since we integrated netdata in all machines, the CPU load increases every 90 minutes up to a load of 120 (which is much to high for a 10 CPU system, where 20 would be ok for a short time).

The load only stays high for a few minutes and then goes back to a level of 2-4 (which simply means, most of the machines IDLE most of the time, which is true).

We checked the processes and found no single process which produces a high load. The only thing is, that all netdata python scripts of the different machines seem to run at the same time and together produce the high load).

Monitoring of one Big Server

What we already did: - most of the netdata plugins are turned off: we only use monitoring of cpu, network, disk, tomcat, apache - netdata plugins run only every 5 seconds (any higher frequency is producing even more load, and the server comes not back to normal load) - turn off the plugins to measure postgres and mongodb (I would like to monitor this, but they completely break the server causing to much load)

My question is:

How can we change the netdata configuration in such way that the regular high peeks of CPU load do not occure. We have 40 equal configurations, 40 tomcats/apache/sql etc. Is it the docker environment in combination with netdata inside the machines?

We can only guess why it happens only every 90 minutes. May be some pattern about the timing how netdata is calling the plugins, I don't know...

Any hints or suggestions how to manage monitoring in a system like this?

0

There are 0 best solutions below