I've been using AI Platform Pipelines (v0.2.5) for several months. I rebuilt the Pipelines instance because I've found a newer version (v0.5.1) on Console. I'm now stuck in completing Pipelines.
It's very weird because there seems not to be failure patterns.
- Pods(Components) randomly fails. Most of the pods successfully complete, while some fail. In addition, failed pods vary depending on the time of executions.
- Pods tell me the error messages of two below, randomly.
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials.
Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application.
For more information, please see https://cloud.google.com/docs/authentication/getting-started
- File "", line 3, in raise_from google.auth.exceptions.RefreshError: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Enginemetadata service. Status: 500 Response:\nb'Could not recursively fetch uri\n'", <google.auth.transport.requests._Response object at 0x7fe5729c9650>)
At GKE Cluster Workload Identity is set. I surely confirm the procedure and the setting is no problem. Though some pods fail, the other pods successfully run with Workload Identity. Of course, Google Cloud Credentials API is enabled.
I don't know these problems are caused by updating Pipelines instance.
Any ideas?