We are running a docker swarm and using Monit to see resources utilisation. The
Process memory for dockerd keeps on growing over time. This happens on all nodes that at least perform a docker action e.g docker inspect or docker exec. I'm suspecting it might be something related to this these actions but I'm not sure how to replicate it. I have a script like
#!/bin/sh
set -eu
containers=$(docker container ls | awk '{if(NR>1) print $NF}')
# Loop forever
while true;
do
for container in $containers; do
echo "Running Inspect on $container"
CONTAINER_STATUS="$(docker inspect $container -f "{{.State}}")"
done
done
but I'm open to other suggestions
Assuming you can run ansible to run a command via ssh on all servers:
A more SRE solution is containerd + Prometheus + AlerManager / Grafana to gather metrics from the swarm nodes and then implement alerting when container thresholds are exceeded.
Don't forget you can simply set a resource constraint on Swarm services to limit the amount of memory and cpu service tasks can consume or be restarted. Then just look for services that keep getting OOM killed.