GCP AI Platform - Pipelines - Clusters - Does not have minimum availability

2.5k Views Asked by At

I can't create pipelines. I can't even load the samples / tutorials on the AI Platform Pipelines Dashboard because it doesn't seem to be able to proxy to whatever it needs to.

An error occurred
Error occured while trying to proxy to: ... 

I looked into the cluster's details and found 3 components with errors:

Deployment  metadata-grpc-deployment     Does not have minimum availability 
Deployment  ml-pipeline  Does not have minimum availability 
Deployment  ml-pipeline-persistenceagent     Does not have minimum availability 

Creating the clusters involve approx. 3 clicks in GCP Kubernetes Engine so I don't think I messed up this step.

Anyone have an idea of how to achieve "minimum availability"?

UPDATE 1

Nodes have adequate resources and are Ready. YAML file looks good. I have 2 clusters in diff regions/zones and both have the deployment errors listed above. 2 Pods are not ok.

Name:         ml-pipeline-65479485c8-mcj9x
Namespace:    default
Priority:     0
Node:         gke-cluster-3-default-pool-007784cb-qcsn/10.150.0.2
Start Time:   Thu, 17 Sep 2020 22:15:19 +0000
Labels:       app=ml-pipeline
              app.kubernetes.io/name=kubeflow-pipelines-3
              pod-template-hash=65479485c8
Annotations:  kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container ml-pipeline-api-server

Status:       Running
IP:           10.4.0.8
IPs:
IP:           10.4.0.8
Controlled By:  ReplicaSet/ml-pipeline-65479485c8
Containers:
  ml-pipeline-api-server:
    Container ID:   ...
    Image:          ...
    Image ID:       ...
    Ports:          8888/TCP, 8887/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Fri, 18 Sep 2020 10:27:31 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Fri, 18 Sep 2020 10:20:38 +0000
      Finished:     Fri, 18 Sep 2020 10:27:31 +0000
    Ready:          False
    Restart Count:  98
    Requests:
      cpu:      100m
    Liveness:   exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
    Readiness:  exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
    Environment:
      HAS_DEFAULT_BUCKET:                   true
      BUCKET_NAME:
      PROJECT_ID:                           <set to the key 'project_id' of config map 'gcp-default-config'>  Optional: false
      POD_NAMESPACE:                        default (v1:metadata.namespace)
      DEFAULTPIPELINERUNNERSERVICEACCOUNT:  pipeline-runner
      OBJECTSTORECONFIG_SECURE:             false
      OBJECTSTORECONFIG_BUCKETNAME:
      DBCONFIG_DBNAME:                      kubeflow_pipelines_3_pipeline
      DBCONFIG_USER:                        <set to the key 'username' in secret 'mysql-credential'>  Optional: false
      DBCONFIG_PASSWORD:                    <set to the key 'password' in secret 'mysql-credential'>  Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from ml-pipeline-token-77xl8 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  ml-pipeline-token-77xl8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ml-pipeline-token-77xl8
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From                                               Message
  ----     ------     ----                  ----                                               -------
  Warning  BackOff    52m (x409 over 11h)   kubelet, gke-cluster-3-default-pool-007784cb-qcsn  Back-off restarting failed container
  Warning  Unhealthy  31m (x94 over 12h)    kubelet, gke-cluster-3-default-pool-007784cb-qcsn  Readiness probe failed:
  Warning  Unhealthy  31m (x29 over 10h)    kubelet, gke-cluster-3-default-pool-007784cb-qcsn  (combined from similar events): Readiness probe failed: c
annot exec in a stopped state: unknown
  Warning  Unhealthy  17m (x95 over 12h)    kubelet, gke-cluster-3-default-pool-007784cb-qcsn  Liveness probe failed:
  Normal   Pulled     7m26s (x97 over 12h)  kubelet, gke-cluster-3-default-pool-007784cb-qcsn  Container image "gcr.io/cloud-marketplace/google-cloud-ai
-platform/kubeflow-pipelines/apiserver:1.0.0" already present on machine
  Warning  Unhealthy  75s (x78 over 12h)    kubelet, gke-cluster-3-default-pool-007784cb-qcsn  Liveness probe errored: rpc error: code = DeadlineExceede
d desc = context deadline exceeded

And the other pod:

Name:         ml-pipeline-persistenceagent-67db8b8964-mlbmv
Events:
  Type     Reason   Age                   From                                               Message
  ----     ------   ----                  ----                                               -------
  Warning  BackOff  32s (x2238 over 12h)  kubelet, gke-cluster-3-default-pool-007784cb-qcsn  Back-off restarting failed container

SOLUTION

Do not let google handle any storage. Uncheck "Use managed storage" and set up your own artifact collections manually. You don't actually need to enter anything in these fields since the pipeline will be launched anyway.

1

There are 1 best solutions below

5
On BEST ANSWER

The Does not have minimum availability error is generic. There could be many issues that trigger it. You need to analyse more in-depth in order to find the actual problem. Here are some possible causes:

  • Insufficient resources: check if your Node has adequate resources (CPU/Memory). If Node is ok than check the Pod's status.

  • Liveliness probe and/or Readiness probe failure: execute kubectl describe pod <pod-name> to check if they failed and why.

  • Deployment misconfiguration: review your deployment yaml file to see if there are any errors or leftovers from previous configurations.

  • You can also try to wait a bit as sometimes it takes some time in order to deploy everything and/or try changing your Region/Zone.