New eks node instance not able to join cluster, getting "cni plugin not initialized"

2.5k Views Asked by At

I am pretty new to terraform and trying to create a new eks cluster with node-group and launch template. The EKS cluster, node-group, launch template, nodes all created successfully. However, when I changed the desired size of the node group (using terraform or the AWS management console), it would fail. No error reported in the Nodg group Health issues tab. I digged further, and found that new instances were launched by the Autoscaling group, but new ones were not able to join the cluster.

Look into the troubled instances, I found the following log by running "sudo journalctl -f -u kubelet"

an 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.612322 3168 eviction_manager.go:254] "Eviction manager: failed to get summary stats" err="failed to get node info: node "ip-10-102-21-129.us-east-2.compute.internal" not found"

Jan 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.654501 3168 kubelet.go:2427] "Error getting node" err="node "ip-10-102-21-129.us-east-2.compute.internal" not found"

Jan 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.755473 3168 kubelet.go:2427] "Error getting node" err="node "ip-10-102-21-129.us-east-2.compute.internal" not found"

Jan 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.776238 3168 kubelet.go:2352] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

Jan 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.856199 3168 kubelet.go:2427] "Error getting node" err="node "ip-10-102-21-129.us-east-2.compute.internal" not found"

Looked like the issue has something to do with the cni add-ons, googled it and others suggest to check for the log inside the /var/log/aws-routed-eni directory. I could find that directory and logs in the working nodes (the ones created initialy when the eks cluster was created), but the same directory and log files do not exist in the newly launch instances nodes (the one created after the cluster was created and by changing the desired node size)

The image I used for the node-group is ami-0af5eb518f7616978 (amazon/amazon-eks-node-1.24-v20230105)

Here is what my script looks like:

resource "aws_eks_cluster" "eks-cluster" {
  name = var.mod_cluster_name 
  role_arn = var.mod_eks_nodes_role
  version = "1.24"
  
  vpc_config {
    security_group_ids = [var.mod_cluster_security_group_id]
    subnet_ids = var.mod_private_subnets
    endpoint_private_access = "true"
    endpoint_public_access = "true"
  }
}
resource "aws_eks_node_group" "eks-cluster-ng" {
  cluster_name = aws_eks_cluster.eks-cluster.name
  node_group_name = "eks-cluster-ng"  
  node_role_arn = var.mod_eks_nodes_role
  subnet_ids = var.mod_private_subnets
  #instance_types = ["t3a.medium"]
   
   
  scaling_config {
    desired_size = var.mod_asg_desired_size
    max_size = var.mod_asg_max_size
    min_size = var.mod_asg_min_size
  }
  
  
  launch_template {
    #name   = aws_launch_template.eks_launch_template.name
    id          = aws_launch_template.eks_launch_template.id
    version     = aws_launch_template.eks_launch_template.latest_version
  }
  
  lifecycle {
    create_before_destroy = true
  }
}
resource "aws_launch_template" "eks_launch_template" {
  
  name = join("", [aws_eks_cluster.eks-cluster.name, "-launch-template"])

  vpc_security_group_ids = [var.mod_node_security_group_id]

  block_device_mappings {
    device_name = "/dev/xvda"

    ebs {
      volume_size = var.mod_ebs_volume_size 
      volume_type = "gp2"
      #encrypted   = false
    }
  }
  
  lifecycle {
    create_before_destroy = true
  }
  
  image_id = var.mod_ami_id
  instance_type = var.mod_eks_node_instance_type
  
  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"  
    http_put_response_hop_limit = 2 
  }

  user_data = base64encode(<<-EOF
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
set -ex

exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1

B64_CLUSTER_CA=${aws_eks_cluster.eks-cluster.certificate_authority[0].data}

API_SERVER_URL=${aws_eks_cluster.eks-cluster.endpoint}

K8S_CLUSTER_DNS_IP=172.20.0.10


/etc/eks/bootstrap.sh ${aws_eks_cluster.eks-cluster.name} --apiserver-endpoint $API_SERVER_URL --b64-cluster-ca $B64_CLUSTER_CA 

--==MYBOUNDARY==--\
  EOF
  )

  tag_specifications {
    resource_type = "instance"

    tags = {
      Name = "EKS-MANAGED-NODE"
    }
  }
}

Another thing I notice is that I tagged the instance Name as "EKS-MANAGED-NODE". That tag showed up correctly in nodes created when the eks cluster was created. However, any new nodes created afterward, the Name changed to "EKS-MANAGED-NODEGROUP-NODE"

I wonder if that indicates there is issue?

I checked the log confirmed that the user-data got looked at and ran when instances started up.

sh-4.2$ more user-data.log

  • B64_CLUSTER_CA=LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakNDQWVhZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJek1ERXlOekU 0TlRrMU1Wb1hEVE16TURFeU5E (deleted the rest)

  • API_SERVER_URL=https://EC283069E9FF1B33CD6C59F3E3D0A1B9.gr7.us-east-2.eks.amazonaws.com

  • K8S_CLUSTER_DNS_IP=172.20.0.10

Using kubelet version 1.24.7 true Using containerd as the container runtime true ‘/etc/eks/containerd/containerd-config.toml’ -> ‘/etc/containerd/config.toml’ ‘/etc/eks/containerd/sandbox-image.service’ -> ‘/etc/systemd/system/sandbox-image.service’ Created symlink from /etc/systemd/system/multi-user.target.wants/containerd.service to /usr/lib/systemd/system/containerd.service. Created symlink from /etc/systemd/system/multi-user.target.wants/sandbox-image.service to /etc/systemd/system/sandbox-image.service. ‘/etc/eks/containerd/kubelet-containerd.service’ -> ‘/etc/systemd/system/kubelet.service’ Created symlink from /etc/sy

I confirmed that the role being specified has all the required permission, the role is being used in other eks cluster, I am trying to create a new one based on the existing one using terraform.

I tried removing the launch template and let aws using the default one. Then new nodes have no issue joining the cluster.

I looked at my launch template script and at the registry https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/launch_template

nowhere mentioned that I need to manually add or run the cni plugin.

So I don't understand why the cni plugin was not installed automatically and why instances are not able to join the cluster.

Any help is appreciated.

1

There are 1 best solutions below

0
On

My answer to this question might not 100% match the OP's question context, but I'm giving it anyway since I also got the same error, and hopefully this would help anyone who encounter this issue a small hand how to investigate the problem and figure out on your own

My context back then: The EC2 Worker Node boot up successfully, but couldn't join the cluster. The node status in kubectl is "NotReady", and "describe" the node gave

Reason                Message
-----                 ----------
KubeletNotReady       container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

The VPC Plugin status of the EKS cluster has bad status: Degraded CNI Plugin error The reason of this error is that there's no CNI network has been defined in /etc/cni/net.d of the worker node machine. My net.d directory of the worker node was all empty.

My lifesaver was the AWS EKS CNI plugin troubleshooting guide. The most precise logs about this issue you can get is within the /var/log/aws-routed-eni/ directory of the worker node

To check the logs, go to the /var/log/aws-routed-eni/ directory, and then locate the file names plugin.log and ipamd.log.

From this stage on, it's context depend, you're on your own depend on what you find in the log. For my case, it was this error: The ENI Config was not found

{"level":"info","ts":"2023-08-30T07:35:44.230Z","caller":"ipamd/ipamd.go:889","msg":"Found ENI Config Name: ap-southeast-1b"}
{"level":"error","ts":"2023-08-30T07:35:44.331Z","caller":"ipamd/ipamd.go:889","msg":"error while retrieving eniconfig: ENIConfig.crd.k8s.amazonaws.com \"ap-southeast-1b\" not found"}

The ENI Config was not found because I haven't provisioned the ENIConfig definition. You can get the complete list of ENIConfig definition of the cluster with kubectl get crd eniconfigs.crd.k8s.amazonaws.com -o yaml

If the CNI Plugin log is not sufficient, you should also check the kubelet log of your worker node instance to see if there's any clue

$ journalctl -u kubelet > kubelet.log
$ less kubelet.log

Best of luck!