"UnknownHostException": Zookeeper 3.5.3 and StatefulSet Kubernetes

1.7k Views Asked by At

Zookeeper 3.5.3-beta does not work for me with GCloud Kubernetes Engine. Using the identical configuration with Zookeeper 3.4.10 works.

When I run a client sanity test, the only exception returned is this:

2017-11-29 14:27:17,597 [myid:1] - WARN  [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@273] - Unexpected exception, tries=0, remaining init limit=20000, connecting to zk-2.zk-svc.default.svc.cluster.local:2888
java.net.UnknownHostException: zk-2.zk-svc.default.svc.cluster.local

While it has been suggested that this problem is kube-dns related as indicated here.
kube-dns (dns.go:48] version: 1.14.4-2-g5584e04) seems to be working as expected:

/ # nslookup zk-0.zk-svc.default.svc.cluster.local
Server:    10.63.240.10
Address 1: 10.63.240.10 kube-dns.kube-system.svc.cluster.local

Name:      zk-0.zk-svc.default.svc.cluster.local
Address 1: 10.60.3.3 zk-0.zk-svc.default.svc.cluster.local
/ # nslookup zk-2.zk-svc.default.svc.cluster.local
Server:    10.63.240.10
Address 1: 10.63.240.10 kube-dns.kube-system.svc.cluster.local

Name:      zk-2.zk-svc.default.svc.cluster.local
Address 1: 10.60.4.3 zk-2.zk-svc.default.svc.cluster.local
/ # nslookup zk-1.zk-svc.default.svc.cluster.local
Server:    10.63.240.10
Address 1: 10.63.240.10 kube-dns.kube-system.svc.cluster.local

Name:      zk-1.zk-svc.default.svc.cluster.local
Address 1: 10.60.2.5 zk-1.zk-svc.default.svc.cluster.local

And there are no errors in the kube-dns log.

In 3.4.10, the first node also produces UnknownHostExceptions on initialization, but eventually provides this type of indication of resolution, but never in 3.5.3

2017-11-29 15:14:39,923 [myid:] - INFO  [main:QuorumPeer$QuorumServer@167] - Resolved hostname: zk-0.zk-svc.default.svc.cluster.local to address: zk-0.zk-svc.default.svc.cluster.local/10.60.4.4

I do not have enough information to file an issue with Zookeeper, so I would appreciate any suggestions on how to debug this.

1

There are 1 best solutions below

0
On BEST ANSWER

Based on a recent comment in ZOOKEEPER-2343, I have deployed a 3.6.0-SNAPSHOT image. The second and third nodes accept client requests immediately, but the first does not and reports "This ZooKeeper instance is not currently serving requests".

Deleting the first node fixes that problem as when it starts, it can participate in the quorum.