Context/problem description

In my team, we are currently working on a way to replace our actual production environment for containerised applications orchestration and management. We actually have a unique VM running Docker with a simple (but very long !) docker-compose.yml file, containing around 10 services.

We do not have enough time nor habilities to set up a proper Kubernetes cluster (and to manage it daily after), so our choice is to migrate to a Docker Swarm cluster, intermediate solution between a basic docker-compose.yml file and a real Kubernetes cluster, with 3 managers and at least 3 workers.

Our problems regarding logging follows:

  • for some apps, we absolutely need to be able to archive and save all the logs produced. These apps have a single replica (there are specific business reasons behind it) but, knowing the principles of Swarm, a replica can be moved over time. And when a replica is moved, it is in fact stopped/killed/trashed at the origin, and recreated on an other node, with loss of the logs generated by the first instance. And we do not want that.
  • for some others, we will have more than one replica, so we are embarassed about a proper way to manage logs coming from different instances, to archive it either together or separately, how to name those files, where to store it (NFS mount, database, third-party app).

Our reflection/things we tried already/ideas we had (and their limits)

We already considered several options:

  • the docker service logs command correctly merges logs from standard outpul of different running replicas, but doesn't keep logs of killed replicas
  • writing logs (for all replicas of the same app) to a single file in a shared folder (mounted via NFS or other) raises a concurrency problem possibility and a corruption of the file, thing that we do not want
  • writing logs (for all replicas of the same app) to a dedicated file for each replica in a shared folder (mounted via NFS or other) solves the concurrency matter, but we face the problem of file naming (replica/task ID ?) and after that, the ability of service to automatically purge old log history (with retention policy configured)
  • setting up a dedicated solution to manage all these logs (like Splunk, GrayLog, ...) is a possibility, but it seems overkilled to us, regarding our little infrastructure

So the question is (are) with all these constraints: are there best practices we do not know, tips and tricks we missed somewhere, or other things to simplify our logs management with the concerns regarding concurrent writing, consistency, multiple instances, and the need to absolutely not lose any log for some critical apps?

1

There are 1 best solutions below

4
Chris Becke On

First, docker service logs <service name or id> will return the logs for all historical and current tasks in any (i.e. a mixed up) order. So you always want to produce a list of tasks, and then query each task with something like docker service logs <task id>.

Next, Docker has a number of log drivers that can be setup either globally in daemon.json, or per service/container. The list can be found via docker info and includes a number of options that can persist logs.

The only log system that docker supports without a 3rd party component are the local and json-file logs which store the logs in /var/lib/docker/containers/<container-id>/**.log. These obviously get reaped when containers get pruned.

Base docker does not automatically prune containers. Docker swarm on the other hand tracks active and stopped tasks and their containers, and has a task history that defines how many tasks / containers it will keep. Again, docker info will tell you this limit and docker swarm update --task-history-limit can be used to alter it on a running swarm. Each replica should have one active task container, and task-history-limit historical containers worth of logs.

Your actual retention of service logs is a function of the history limit and how quickly tasks are being cycled.

Finally, Docker swarm never spontaneously moves a running task/container to another node. Outside of overt user actions like redeployment, the task container would have to exit, fail its health check or the node its on would have to, itself, become unhealthy.