Why might a gvisor based node pool fail to bootstrap properly?

286 Views Asked by At

I'm trying to provision a new node pool using gvisor sandboxing in GKE. I use the GCP web console to add a new node pool, use the cos_containerd OS and check the Enable gvisor Sandboxing checkbox, but the node pool provisioning fails each time with an "Unknown Error" in the GCP console notifications. The nodes never join the K8S cluster.

The GCE VM seems to boot fine and when I look in the journalctl for the node I see that cloud-init seems to have finished just fine, but the kubelet doesn't seem to be able to start. I see error messages like this:

Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.184163    1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.284735    1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.385229    1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.485626    1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.522961    1143 eviction_manager.go:251] eviction manager: failed to get summary stats: failed to get node info: node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz containerd[976]: time="2020-10-12T16:58:07.576735750Z" level=error msg="Failed to load cni configuration" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.577353    1143 kubelet.go:2191] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.587824    1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.989869    1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:08 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:08.090287    1143 
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.296365    1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.396933    1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz node-problem-detector[1166]: F1012 16:58:09.449446    2481 main.go:71] cannot create certificate signing request: Post https://172.17.0.2/apis/certificates.k8s.io/v1beta1/certificatesigningrequests?timeout=5m0s: dial tcp 172.17.0.2:443: connect: no route 
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz node-problem-detector[1166]: E1012 16:58:09.450695    1166 manager.go:162] failed to update node conditions: Patch https://172.17.0.2/api/v1/nodes/gke-main-sanboxes-dd9b8d84-dmzz/status: getting credentials: exec: exit status 1
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.453825    2486 cache.go:125] failed reading existing private key: open /var/lib/kubelet/pki/kubelet-client.key: no such file or directory
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.543449    1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.556623    2486 tpm.go:124] failed reading AIK cert: tpm2.NVRead(AIK cert): decoding NV_ReadPublic response: handle 1, error code 0xb : the handle is not correct for the use

I am not really sure what might be causing that, and I'd really like to be able to use autoscaling with this node pool, so I don't want to just fix it manually for this node and have to do so for any new nodes that join. How can I configure the node pool such that the gvisor based nodes provision fine on their own?

My cluster details:

  • GKE version: 1.17.9-gke.6300
  • Cluster type: Regional
  • VPC-native
  • Private cluster
  • Shielded GKE Nodes
1

There are 1 best solutions below

1
On

You can report issues with Google products by following below link:

You will need to choose the: Create new Google Kubernetes Engine issue under Compute section.


I can confirm that I stumbled upon the same issue when creating a cluster as described in the question (private, shielded, etc.):

  • Create a cluster with one node pool.
  • Add the node pool with gvisor enabled after the cluster successfully created.

Creating cluster like above will push the GKE cluster to RECONCILING state:

NAME        LOCATION      MASTER_VERSION   MASTER_IP       MACHINE_TYPE  NODE_VERSION     NUM_NODES  STATUS
gke-gvisor  europe-west3  1.17.9-gke.6300  XX.XXX.XXX.XXX  e2-medium     1.17.9-gke.6300  6          RECONCILING

The changes in the cluster state:

  • Provisoning - creating the cluster
  • Running - created the cluster
  • Reconciling - added the node pool
  • Running - the node pool was added (for about a minute)
  • Reconciling - the cluster went into that state for about 25 minutes

GCP Cloud Console (Web UI) reports: Repairing Cluster