How to recover Kubernetes cluster created on AWS using KOPS?

355 Views Asked by At

We were trying to upgrade the Kops version of the Kubernetes Cluster. We have followed the below steps for that;

  1. Download the latest KOPS version 1.24 (the old version is 1.20)
  2. Do the template changes according to 1.24
  3. Set ENV variables
export KUBECONFIG="<<Kubeconfig file>>"
export AWS_PROFILE="<< AWS PROFILE NAME >>"
export AWS_DEFAULT_REGION="<< AWS Region >>"
export KOPS_STATE_STORE="<< AWS S3 Bucket Name >>"
export NAME="<< KOPS Cluster Name >>"
  1. kops get $NAME -o yaml > existing-cluster.yaml

  2. kops toolbox template --template templates/tm-eck-mixed-instances.yaml --values values_files/values-us-east-1.yaml --snippets snippets --output cluster.yaml --name $NAME

  3. kops replace -f cluster.yaml

  4. kops update cluster --name $NAME

  5. kops rolling-update cluster --name $NAME --instance-group=master-us-east-1a --yes --cloudonly

Once the master is rolled over I noticed that this master is not joined to the cluster. After a few rounds of troubleshooting, I found the below error in the API server.

I0926 09:54:41.220817 1 flags.go:59] FLAG: --vmodule="" I0926 09:54:41.223834 1 dynamic_serving_content.go:111] Loaded a new cert/key pair for "serving-cert::/srv/kubernetes/kube-controller-manager/server.crt::/srv/kubernetes/kube-controller-manager/server.key" unable to load configmap based request-header-client-ca-file: Get "https://127.0.0.1/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 127.0.0.1:443: connect: connection refused

I have tried to resolve this issue and couldn't find a way, SO decided to roll back using a backup. These are the steps I've followed for that;

  1. kops replace -f cluster.yaml
  2. kops update cluster --name $NAME
  3. kops rolling-update cluster --name $NAME --instance-group=master-us-east-1a --yes --cloudonly

Still, I'm getting the same error in the Master node.

Does anyone know how I can restore the cluster using Kops ??

1

There are 1 best solutions below

0
Samith Perera On BEST ANSWER

After a few rounds of troubleshooting, I've found that whenever we deploy a new version using kops it's creating a new version in the launch template in AWS. I have manually changed the launch template version used in the Auto scaling group of all node groups. Then cluster is rollbacked to the previous state and starts working properly. Then I reran the upgrade process after adding the missing configurations into the kops template file.