We were trying to upgrade the Kops version of the Kubernetes Cluster. We have followed the below steps for that;
- Download the latest KOPS version 1.24 (the old version is 1.20)
- Do the template changes according to 1.24
- Set ENV variables
export KUBECONFIG="<<Kubeconfig file>>" export AWS_PROFILE="<< AWS PROFILE NAME >>" export AWS_DEFAULT_REGION="<< AWS Region >>" export KOPS_STATE_STORE="<< AWS S3 Bucket Name >>" export NAME="<< KOPS Cluster Name >>"
kops get $NAME -o yaml > existing-cluster.yaml
kops toolbox template --template templates/tm-eck-mixed-instances.yaml --values values_files/values-us-east-1.yaml --snippets snippets --output cluster.yaml --name $NAME
kops replace -f cluster.yaml
kops update cluster --name $NAME
kops rolling-update cluster --name $NAME --instance-group=master-us-east-1a --yes --cloudonly
Once the master is rolled over I noticed that this master is not joined to the cluster. After a few rounds of troubleshooting, I found the below error in the API server.
I0926 09:54:41.220817 1 flags.go:59] FLAG: --vmodule="" I0926 09:54:41.223834 1 dynamic_serving_content.go:111] Loaded a new cert/key pair for "serving-cert::/srv/kubernetes/kube-controller-manager/server.crt::/srv/kubernetes/kube-controller-manager/server.key" unable to load configmap based request-header-client-ca-file: Get "https://127.0.0.1/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 127.0.0.1:443: connect: connection refused
I have tried to resolve this issue and couldn't find a way, SO decided to roll back using a backup. These are the steps I've followed for that;
- kops replace -f cluster.yaml
- kops update cluster --name $NAME
- kops rolling-update cluster --name $NAME --instance-group=master-us-east-1a --yes --cloudonly
Still, I'm getting the same error in the Master node.
Does anyone know how I can restore the cluster using Kops ??
After a few rounds of troubleshooting, I've found that whenever we deploy a new version using kops it's creating a new version in the launch template in AWS. I have manually changed the launch template version used in the Auto scaling group of all node groups. Then cluster is rollbacked to the previous state and starts working properly. Then I reran the upgrade process after adding the missing configurations into the kops template file.