AI Platform Pipelines sometimes and randomly fails

158 Views Asked by At

I've been using AI Platform Pipelines (v0.2.5) for several months. I rebuilt the Pipelines instance because I've found a newer version (v0.5.1) on Console. I'm now stuck in completing Pipelines.

It's very weird because there seems not to be failure patterns.

  • Pods(Components) randomly fails. Most of the pods successfully complete, while some fail. In addition, failed pods vary depending on the time of executions.
  • Pods tell me the error messages of two below, randomly.
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. 
Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. 
For more information, please see https://cloud.google.com/docs/authentication/getting-started
  1. File "", line 3, in raise_from google.auth.exceptions.RefreshError: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Enginemetadata service. Status: 500 Response:\nb'Could not recursively fetch uri\n'", <google.auth.transport.requests._Response object at 0x7fe5729c9650>)

At GKE Cluster Workload Identity is set. I surely confirm the procedure and the setting is no problem. Though some pods fail, the other pods successfully run with Workload Identity. Of course, Google Cloud Credentials API is enabled.

I don't know these problems are caused by updating Pipelines instance.

Any ideas?

0

There are 0 best solutions below