Vertx clustered eventbus not removing old node on kubernetes rolling deployment

325 Views Asked by At

I have two vertx micro services running in cluster and communicate with each other using a headless service(link) in on premise cloud. Whenever I do a rolling deployment I am facing connectivity issue within services. When I analysed the log I can see that old node/pod is getting removed from cluster list but the event bus is not removing it and using it in round robin basis.

Below is the member group information before deployment

    Member [192.168.4.54]:5701 - ace32cef-8cb2-4a3b-b15a-2728db068b80        //pod 1
    Member [192.168.4.54]:5705 - f0c39a6d-4834-4b1d-a179-1f0d74cabbce this
    Member [192.168.101.79]:5701 - ac0dcea9-898a-4818-b7e2-e9f8aaefb447      //pod 2

When deployment is started, pod 2 gets removed from the member list,

[192.168.4.54]:5701 [dev] [4.0.2] Could not connect to: /192.168.101.79:5701. Reason: SocketException[Connection refused to address /192.168.101.79:5701]
    Removing connection to endpoint [192.168.101.79]:5701 Cause => java.net.SocketException {Connection refused to address /192.168.101.79:5701}, Error-Count: 5
    Removing Member [192.168.101.79]:5701 - ac0dcea9-898a-4818-b7e2-e9f8aaefb447

And new member is added,

Member [192.168.4.54]:5701 - ace32cef-8cb2-4a3b-b15a-2728db068b80
    Member [192.168.4.54]:5705 - f0c39a6d-4834-4b1d-a179-1f0d74cabbce this
    Member [192.168.94.85]:5701 - 1347e755-1b55-45a3-bb9c-70e07a29d55b  //new pod
All migration tasks have been completed. (repartitionTime=Mon May 10 08:54:19 MST 2021, plannedMigrations=358, completedMigrations=358, remainingMigrations=0, totalCompletedMigrations=3348, elapsedMigrationTime=1948ms, totalElapsedMigrationTime=27796ms)

But when a request is made to the deployed service, event though old pod is removed from member group the event bus is using the old pod/service reference(ac0dcea9-898a-4818-b7e2-e9f8aaefb447),

[vert.x-eventloop-thread-1] DEBUG io.vertx.core.eventbus.impl.clustered.ConnectionHolder - tx.id=f9f5cfc9-8ad8-4eb1-b12c-322feb0d1acd Not connected to server ac0dcea9-898a-4818-b7e2-e9f8aaefb447 - starting queuing

I checked the official documentation for rolling deployment and my deployment seems to be following two key things mentioned in documentation, only one pod removed and then the new one is added.

never start more than one new pod at once

forbid more than one unavailable pod during the process

I am using vertx 4.0.3 and hazelcast kubernetes 1.2.2. My verticle class is extending AbstractVerticle and deploying using,

Vertx.clusteredVertx(options, vertx -> {
                    vertx.result().deployVerticle(verticleName, deploymentOptions);

Sorry for the long post, any help is highly appreciated.

1

There are 1 best solutions below

0
On

One possible reason could be due to a race condition with Kubernetes removing the pod and updating the endpoint in Kube-proxy as detailed in this extensive article. This race condition will lead to Kubernetes continuing to send traffic to the pod being removed after it has terminated.

One TL;DR solution is to add a delay when terminating a pod by either:

  1. Have the service delay when it receives a SIGTERM (e.g. for 15 sec) such that it keeps responding to requests during that delay period like normal.
  2. Use the Kubernetes preStop hook to execute a sleep 15 command on the container. This allows the service to continue responding to requests during that 15 second period while Kubernetes is updating it's endpoints. Kubernetes will send SIGTERM when the preStop hook completes.

Both solutions will give Kubernetes some time to propagate changes to it's internal components so that traffic stops being routed to the pod being removed.

A caveat to this answer is that I'm not familiar with Hazelcast Clustering and how your specific discover mode is setup.