Failed to allocate network resources

5.8k Views Asked by At

Docker-CE 19.03.8 Swarm init Setup: 1 Manager Node nothing more.

We deploy many new stacks per day and sometime i see the following line:

evel=error msg="Failed to allocate network resources for node sdlk0t6pyfb7lxa2ie3w7fdzr" error="could not find network allocator state for network qnkxurc5etd2xrkb53ry0fu59" module=node node.id=yp0u6n9c31yh3xyekondzr4jc

After 2 to 3 days. No new services can be started because there are no free VIPs. I see the following line in my logs:

level=error msg="Could not parse VIP address  while releasing"                                                                                                                                       
level=error msg="error deallocating vip" error="invalid CIDR address: " vip.addr= vip.network=oqcsj99taftdu3b0t3nrgbgy1                                                                              
level=error msg="Event api.EventUpdateTask: Failed to get service idid0u7vjuxf2itpv8n31da57 for task 6vnc8jdkgxwxqbs3ixly2i6u4 state NEW: could not find service idid0u7vjuxf2itpv8n31da57" module=node ...
level=error msg="Event api.EventUpdateTask: Failed to get service sbjb7nk0wk31c2ayg8x898fhr for task noo21whnbwkyijnqavseirfg0 state NEW: could not find service sbjb7nk0wk31c2ayg8x898fhr" module=node ...
level=error msg="Failed to find network y73pnq85mjpn1pon38pdbtaw2 on node sdlk0t6pyfb7lxa2ie3w7fdzr" module=node node.id=yp0u6n9c31yh3xyekondzr4jc 

We tried to investigate this by using the debug mode. Here are some lines that get to me:

level=debug msg="Remove interface veth84e7185 failed: Link not found"
level=debug msg="Remove interface veth64c3a65 failed: Link not found"
level=debug msg="Remove interface vethf1703f1 failed: Link not found"
level=debug msg="Remove interface vethe069254 failed: Link not found"
level=debug msg="Remove interface veth2b81763 failed: Link not found"
level=debug msg="Remove interface veth0bf3390 failed: Link not found"
level=debug msg="Remove interface veth2ed04cc failed: Link not found"
level=debug msg="Remove interface veth0bc27ef failed: Link not found"
level=debug msg="Remove interface veth444343f failed: Link not found"
level=debug msg="Remove interface veth036acf9 failed: Link not found"
level=debug msg="Remove interface veth62d7977 failed: Link not found"

and

level=debug msg="Request address PoolID:10.0.0.0/24 App: ipam/default/data, ID: GlobalDefault/10.0.0.0/24, DBIndex: 0x0, Bits: 256, Unselected: 60, Sequence: (0xf7dfeeee, 1)->(0xedddddb7, 1)->(0x77777777, 3)->(0x77777775, 1)->(0x77ffffff, 1)->(0xffd55555, 1)->end Curr:233 Serial:true PrefAddress:<

When the UNSELECTED part goes to 0 no new containers can be deployed. They are stuck in the NEW state.

Has anyone expirenced something like this? Or can someone help me? We believe, that the problem has to do something with the release of the 10.0.0.0/24 (our ingress) addresses.

2

There are 2 best solutions below

0
On

If you see your container stuck in NEW state, probably your are affected by this problem: https://github.com/moby/moby/issues/37338 reported by cintiadr:

Docker stack fails to allocate IP on an overlay network, and gets stuck in NEW current state #37338

Reproducing it:

Create a swarm cluster (1 manager, 1 worker). I created AWS t2.large Amazon linux instances, installed docker using their docs, version 18.06.1-ce.

# Deploy a new overlay network from a stack (docker-network.yml)
$ ./deploy-network.sh
 
Deploy 60 identical services attaching to that network - 3 replicas each - from stacks (docker-network.yml)
$ ./deploy-services.sh

You can verify that all services are happily running.

Now let's bring the worker down.

Run:

docker node update --availability drain <node id> && docker node rm --force <node id>

Note: drain is an async operation (something I wasn't aware), so to reproduce this use case you shouldn't wait for the drain to complete

Create a new worker (completely new node/machine), and join the cluster. You are going to see that very few services are actually able to start. All other will be continuously being rejected due to no IP available.

In past versions (17 I believe), the containers wouldn't be rejected (but rather be stuck in NEW).

How to avoid that problem?

If you drain and patiently wait for all the containers to be terminated before removing the node, it appears that this problem is completely avoided.

1
On

Did you tried to stop and re- start the docker demon?

sudo service docker stop
sudo service docker start

Also, you may find it useful to have a look at the magnificent documentation on https://dockerswarm.rocks/

I usually use this sequence to update a service

export DOMAIN=xxxx.xxxxx.xxx
docker stack rm $service_name
export NODE_ID=$(docker info -f '{{.Swarm.NodeID}}')
# export environment vars if needed
# update data if needed
docker node update --label-add $service_name.$service_name-data=true $NODE_ID
docker stack deploy -c $service_name.yml $service_name