Zookeeper 3.5.3-beta does not work for me with GCloud Kubernetes Engine. Using the identical configuration with Zookeeper 3.4.10 works.
When I run a client sanity test, the only exception returned is this:
2017-11-29 14:27:17,597 [myid:1] - WARN [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@273] - Unexpected exception, tries=0, remaining init limit=20000, connecting to zk-2.zk-svc.default.svc.cluster.local:2888
java.net.UnknownHostException: zk-2.zk-svc.default.svc.cluster.local
While it has been suggested that this problem is kube-dns related as indicated here.
kube-dns (dns.go:48] version: 1.14.4-2-g5584e04) seems to be working as expected:
/ # nslookup zk-0.zk-svc.default.svc.cluster.local
Server: 10.63.240.10
Address 1: 10.63.240.10 kube-dns.kube-system.svc.cluster.local
Name: zk-0.zk-svc.default.svc.cluster.local
Address 1: 10.60.3.3 zk-0.zk-svc.default.svc.cluster.local
/ # nslookup zk-2.zk-svc.default.svc.cluster.local
Server: 10.63.240.10
Address 1: 10.63.240.10 kube-dns.kube-system.svc.cluster.local
Name: zk-2.zk-svc.default.svc.cluster.local
Address 1: 10.60.4.3 zk-2.zk-svc.default.svc.cluster.local
/ # nslookup zk-1.zk-svc.default.svc.cluster.local
Server: 10.63.240.10
Address 1: 10.63.240.10 kube-dns.kube-system.svc.cluster.local
Name: zk-1.zk-svc.default.svc.cluster.local
Address 1: 10.60.2.5 zk-1.zk-svc.default.svc.cluster.local
And there are no errors in the kube-dns log.
In 3.4.10, the first node also produces UnknownHostExceptions on initialization, but eventually provides this type of indication of resolution, but never in 3.5.3
2017-11-29 15:14:39,923 [myid:] - INFO [main:QuorumPeer$QuorumServer@167] - Resolved hostname: zk-0.zk-svc.default.svc.cluster.local to address: zk-0.zk-svc.default.svc.cluster.local/10.60.4.4
I do not have enough information to file an issue with Zookeeper, so I would appreciate any suggestions on how to debug this.
Based on a recent comment in ZOOKEEPER-2343, I have deployed a 3.6.0-SNAPSHOT image. The second and third nodes accept client requests immediately, but the first does not and reports "This ZooKeeper instance is not currently serving requests".
Deleting the first node fixes that problem as when it starts, it can participate in the quorum.