I am getting an error Could note join nodes to cluster
when trying to create an EKS Node Group using a Launch Template.
I can create a Node Group if I do not specify a Launch Template and choose an AMI directly in the Create Node Group console configuration. Note all other configurations are the same (VPC, Private Subnets, Security Group, IAM Role, Instance Type, Min/Max Instances). Here I choose the AMI ID ami-06be503a3852b6423
and use a g4dn.2xLarge
with on demand instances.
Launch Instance Details:
{
"LaunchTemplateVersions": [
{
"LaunchTemplateId": "TEMPLATEID",
"LaunchTemplateName": "eks-gpu-optimized-template-engine-primary",
"VersionNumber": 2,
"VersionDescription": "adding key",
"CreateTime": "2024-01-17T21:13:31+00:00",
"CreatedBy": "arn:aws:iam::USER",
"DefaultVersion": false,
"LaunchTemplateData": {
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sdb",
"VirtualName": "ephemeral0"
}
],
"ImageId": "ami-099c85b23bfc2fd16",
"InstanceType": "g4dn.2xlarge",
"KeyName": "XXXXXXXXX",
"SecurityGroupIds": [
"sg-XXXXXXXXXXXXX"
]
}
}
]
}
IAM Roles and Security Group Details
EKS Cluster IAM Role
{
"Role": {
"Path": "/",
"RoleName": "eksClusterRole",
"RoleId": "ROLEID",
"Arn": "arn:aws:iam::XXXXX:role/eksClusterRole",
"CreateDate": "2023-11-27T17:57:08+00:00",
"AssumeRolePolicyDocument": {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"ec2.amazonaws.com",
"eks.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
},
"Description": "Allows access to other AWS service resources that are required to operate clusters managed by EKS.",
"MaxSessionDuration": 3600,
"RoleLastUsed": {
"LastUsedDate": "2024-01-17T22:40:41+00:00",
"Region": "us-east-2"
}
}
}
This has the attached policies:
- AmazonEC2ContainerRegistryReadOnly AWS managed 1
- AmazonEKS_CNI_Policy AWS managed 3
- AmazonEKSClusterPolicy AWS managed 3
- AmazonEKSServicePolicy AWS managed 3
- AmazonEKSVPCResourceController AWS managed 3
- AmazonEKSWorkerNodePolicy AWS managed 3
Note - I use this role for both the EKS Cluster and the Node Group I know it's not best practice and has redundant permissions, but the scope works for both the cluster and node group, which again works when I create a Node Group with this role without using a Launch Template.
Security Group
I use the same security group for everything in this environment. I'm aware it's not best practice, but the default allows all traffic inbound and outbound. Just trying to eliminate where the error is occuring.
VPC & Subnets
I am creating the Node Group on the same VPC as the EKS Cluster. The node group is assigned 2 private subnets - the EKS Cluster is configured to use all subnets on the VPC.
Note - I have private access enabled on the EKS Cluster endpoint_access
management.
Note - My private subnet(s) have an NAT attached with access to public subnet and internet gateway Again, this works using the same subnets and roles configuration just without using a launch template
EKS Cluster Details:
{
"cluster": {
"name": "us-2-prod-primary-cluster",
"arn": "arn:aws:eks:us-east-2:XXXXXX:cluster/us-2-prod-primary-cluster",
"createdAt": "2024-01-11T14:18:07.322000+00:00",
"version": "1.28",
"endpoint": "https://XXXXXXXXXXXXX.gr7.us-east-2.eks.amazonaws.com",
"roleArn": "arn:aws:iam::792342206980:role/eksClusterRole",
"resourcesVpcConfig": {
"subnetIds": [
"subnet-xdd5",
"subnet-x9bb",
"subnet-xef7",
"subnet-x5e7",
"subnet-x84e",
"subnet-xcf4",
"subnet-xbc4",
"subnet-x785",
"subnet-x04c"
],
"securityGroupIds": [
"sg-XXXXX"
],
"clusterSecurityGroupId": "sg-XXX",
"vpcId": "vpc-0f7294aef4d6f6e6f",
"endpointPublicAccess": false,
"endpointPrivateAccess": true,
"publicAccessCidrs": []
},
"kubernetesNetworkConfig": {
"serviceIpv4Cidr": "10.100.0.0/16",
"ipFamily": "ipv4"
},
"logging": {
"clusterLogging": [
{
"types": [
"api",
"audit",
Other Information:
The AMI used in my launch template is a custom AMI.
- I first created an instance from the same AMI that works when I create a node group without a Launch Template -
AMI ID: ami-06be503a3852b6423
- I connected to this newly created instance and create the directories I need for my containerize application, and installed all of the models for a DeepGram deployment I need.
- I then made an AMI from this machine (so that the node group uses this AMI in a launch template so all new nodes will have the models and directory installed
Things I have tried:
- using a new launch template with no file configuration, just uses the Amazon Linux 2 EKS Optimized GPU based AMI (
ami-06be503a3852b6423
). - Removed all configuration besides AMI from the Launch Template and defined it in the node group
- Putting the node group on a public subnet.
- Creating an AMI from the ec2 machines generated out of creating a Node Group with no launch template, then made a Launch Template from it.
I cannot figure out why it always fails to create a Node Group when I use a Launch Template.
Any recommendations or new things to try is greatly appreciated.
EDIT
clarifying that my EKS Cluster / Control Plane Kubernetes version is 1.28 which matches the EKS Optimized gpu-node AMI I am using (ami-06be503a3852b6423 - amazon-eks-gpu-node-1.28-v20240110
)
also i constantly see this error in kube-apiserver
logs:
2024-01-17T22:53:08.000-06:00 Copy W0118 04:53:08.826989 9 logging.go:59] [core] [Channel #26636 SubChannel #26637] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.x.x:xxxx", ServerName: "10.x.xx.xx", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.32.16:2379: operation was canceled" W0118 04:53:08.826989 9 logging.go:59] [core] [Channel #26636 SubChannel #26637] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.xx.xx:xxxx", ServerName: "10.0.xx.xx", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.xx.xx:xxxx: operation was canceled"