Intermittent kubectl apply error when run from terraform after aws_eks_cluster created

297 Views Asked by At

In my main.tf I have this, that I run via terraform 0.12.24 on ubuntu:

module "eks_cluster" {
  source = "git::https://github.com/cloudposse/terraform-aws-eks-cluster.git?ref=tags/0.20.0"

  namespace             = null
  stage                 = null
  name                  = var.stack_name
  attributes            = []
  tags                  = var.tags
  region                = var.region
  vpc_id                = module.vpc.vpc_id
  subnet_ids            = module.subnets.public_subnet_ids
  kubernetes_version    = var.kubernetes_version
  oidc_provider_enabled = var.oidc_provider_enabled

  workers_role_arns = [
    module.eks_node_group.eks_node_group_role_arn,
    # module.eks_fargate_profile_fg.eks_fargate_profile_role_arn,
  ]
  workers_security_group_ids = []
}

...

resource "local_file" "k8s_service_account_pods_default" {
  filename = "${path.root}/kubernetes-default.yaml"
  content  = <<SERVICE_ACCOUNT
apiVersion: v1
kind: ServiceAccount
metadata:
  name: aws-for-pods
  namespace: default
  annotations:
    eks.amazonaws.com/role-arn: ${var.pod_role_arn}
SERVICE_ACCOUNT

  provisioner "local-exec" {
    command = "kubectl apply -f ${local_file.k8s_service_account_pods_default.filename}"
  }
}

This works well most of the time; sometimes, I get this error:

Error: Error running command 'kubectl apply -f ./kubernetes-default.yaml': 
  exit status 1. Output: error: unable to recognize "./kubernetes-default.yaml": 
  Get https://<redacted>.us-east-2.eks.amazonaws.com/api?timeout=32s: dial tcp: 
  lookup <redacted>.us-east-2.eks.amazonaws.com on 192.168.2.1:53: no such host

If I run terraform apply even immediately after, that time the kubectl apply works. I'm guessing there's about 30 sec - 1 min delay between the two kubectl apply's, so probably the api server just wasn't really ready yet.

Looks like there is time_sleep resource, but that seems hackish. Doesn't seem like I can mark the local_file with depends-on on a resource inside a module either (seems like terraform is working on this).

Any suggestions, is time_sleep my only option?

0

There are 0 best solutions below