Why does my call to google.cloud.run_v2.JobsClient() consistently block for 87 seconds?

116 Views Asked by At

I am writing a RESTful API in Python using Flask that needs to run a Google Cloud Run Job. Having installed the latest version (0.10.1) of google-cloud-run, my code is:

    client = run_v2.JobsClient()
    job_request = run_v2.RunJobRequest(dict(
        name="projects/my-project/locations/my-location/jobs/my-job",
        overrides=dict(
            container_overrides=[dict(
                args=["some", "command", "line", "arguments"]
            )]
        )
    ))
    operation = client.run_job(request=job_request)

For a while the code ran well in two different environments:

1.In a pytest unit test running on my local dev machine (authenticated by gcloud auth application-default login)

  1. In a Flask app in a staging container running as a Cloud Run Service (where application default credentials are in place, I assume).

Now when I run the code, it consistently blocks on the call client_run_job() for exactly 87 seconds and then returns successfully.

Having enabled debug logging, I see these messages up to when the call blocks:

DEBUG    google.auth._default:_default.py:255 Checking None for explicit credentials as part of auth process...
DEBUG    google.auth._default:_default.py:228 Checking Cloud SDK credentials as part of auth process...
DEBUG    google.auth.transport.requests:requests.py:185 Making request: POST https://oauth2.googleapis.com/token
DEBUG    urllib3.connectionpool:connectionpool.py:1052 Starting new HTTPS connection (1): oauth2.googleapis.com:443

Only after 87 seconds do the logs resume with:

DEBUG    urllib3.connectionpool:connectionpool.py:546 https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None

Having enabled logging for gRPC I can also see the delay in these logs:

D0111 17:55:06.794000000 22720 src/core/ext/filters/client_channel/resolver/dns/dns_resolver_plugin.cc:57] Using native dns resolver
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "rls_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_cluster_manager_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_cluster_impl_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "cds_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_cluster_resolver_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_override_host_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_wrr_locality_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "ring_hash_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/security/certificate_provider/certificate_provider_registry.cc:33] registering certificate provider factory for "file_watcher"

I0111 17:56:32.489000000 22720 src/core/lib/event_engine/thread_pool/work_stealing_thread_pool.cc:216] WorkStealingThreadPoolImpl::Quiesce
D0111 17:56:32.489000000 22720 src/core/lib/event_engine/thread_pool/work_stealing_thread_pool.cc:327] Waiting for lifeguard thread to shut down
D0111 17:56:32.489000000 22720 src/core/lib/surface/init.cc:159] grpc_shutdown starts clean-up now

My question is: Is there something I'm missing in setting up the Cloud Run client? How can I further investigate what the cause of this long delay is?

1

There are 1 best solutions below

0
On

I am answering my own question in case someone else bumps into this issue and can benefit from what I have learned.

Apologies to all who tried to help - The root cause stems from details which I did not share in my question because I mistakenly thought they were not relevant.

The details I missed are that in my Cloud Run Service, my Flask API application is running inside gunicorn and using eventlet as its worker class:

exec gunicorn -w 1 -k eventlet  --bind 0.0.0.0:8080 --forwarded-allow-ips='*' "my_app.app.app"

eventlet is an async I/O library for Python that can be set up as the worker class that gunicorn uses to handle requests. I used it because my application includes a python-socketio endpoint and that requires an async worker class.

What I've found is that removing the -k eventlent argument makes the issue go away in a consistent fashion. Some comments that I've traced online suggest that eventlet does not work well with grpc which is what the Cloud Run SDK for Python uses under the hood for server communication.

What's more, eventlet is now in maintenance mode and is no longer recommended for use in new projects because:

...since several years the maintenance activity of eventlet dicreased [sic] leading to a growing gap between eventlet and the CPython implementation.

This gap is now too high and can lead you to unexpected side effects and bugs in your applications.

As I still need a solution for an async worker in gunicorn, I will be transitioning to gevent now.