I am writing a RESTful API in Python using Flask that needs to run a Google Cloud Run Job. Having installed the latest version (0.10.1) of google-cloud-run
, my code is:
client = run_v2.JobsClient()
job_request = run_v2.RunJobRequest(dict(
name="projects/my-project/locations/my-location/jobs/my-job",
overrides=dict(
container_overrides=[dict(
args=["some", "command", "line", "arguments"]
)]
)
))
operation = client.run_job(request=job_request)
For a while the code ran well in two different environments:
1.In a pytest unit test running on my local dev machine (authenticated by gcloud auth application-default login
)
- In a Flask app in a staging container running as a Cloud Run Service (where application default credentials are in place, I assume).
Now when I run the code, it consistently blocks on the call client_run_job()
for exactly 87 seconds and then returns successfully.
Having enabled debug logging, I see these messages up to when the call blocks:
DEBUG google.auth._default:_default.py:255 Checking None for explicit credentials as part of auth process...
DEBUG google.auth._default:_default.py:228 Checking Cloud SDK credentials as part of auth process...
DEBUG google.auth.transport.requests:requests.py:185 Making request: POST https://oauth2.googleapis.com/token
DEBUG urllib3.connectionpool:connectionpool.py:1052 Starting new HTTPS connection (1): oauth2.googleapis.com:443
Only after 87 seconds do the logs resume with:
DEBUG urllib3.connectionpool:connectionpool.py:546 https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
Having enabled logging for gRPC I can also see the delay in these logs:
D0111 17:55:06.794000000 22720 src/core/ext/filters/client_channel/resolver/dns/dns_resolver_plugin.cc:57] Using native dns resolver
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "rls_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_cluster_manager_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_cluster_impl_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "cds_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_cluster_resolver_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_override_host_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_wrr_locality_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "ring_hash_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/security/certificate_provider/certificate_provider_registry.cc:33] registering certificate provider factory for "file_watcher"
I0111 17:56:32.489000000 22720 src/core/lib/event_engine/thread_pool/work_stealing_thread_pool.cc:216] WorkStealingThreadPoolImpl::Quiesce
D0111 17:56:32.489000000 22720 src/core/lib/event_engine/thread_pool/work_stealing_thread_pool.cc:327] Waiting for lifeguard thread to shut down
D0111 17:56:32.489000000 22720 src/core/lib/surface/init.cc:159] grpc_shutdown starts clean-up now
My question is: Is there something I'm missing in setting up the Cloud Run client? How can I further investigate what the cause of this long delay is?
I am answering my own question in case someone else bumps into this issue and can benefit from what I have learned.
Apologies to all who tried to help - The root cause stems from details which I did not share in my question because I mistakenly thought they were not relevant.
The details I missed are that in my Cloud Run Service, my Flask API application is running inside gunicorn and using eventlet as its worker class:
eventlet is an async I/O library for Python that can be set up as the worker class that gunicorn uses to handle requests. I used it because my application includes a python-socketio endpoint and that requires an async worker class.
What I've found is that removing the
-k eventlent
argument makes the issue go away in a consistent fashion. Some comments that I've traced online suggest that eventlet does not work well with grpc which is what the Cloud Run SDK for Python uses under the hood for server communication.What's more, eventlet is now in maintenance mode and is no longer recommended for use in new projects because:
As I still need a solution for an async worker in gunicorn, I will be transitioning to gevent now.