Netflix conductor HTTP tasks stuck on scheduled state for a long time

959 Views Asked by At

We have Netflix conductor deployed on GCP, with a strong Postgres persistence storage.

Whenever more then 3k workflows are starting to execute in parallel (Each workflow has like 4 HTTP tasks), The time that takes for an HTTP task to start executing is getting larger and larger.

It's simply stuck on scheduled state, could be stuck for long minutes on higher loads.

We checked the workload metrics for the conductor servers and the Postgres DB and they are far from reaching there resource limits.

We thought about using isolation tasks for these HTTP tasks, but that will not be beneficial since 80% of all tasks executed are these HTTP tasks that we don't want to be stuck on scheduled.

Which configurations\Settings\Setup should I change In order to solve the problem of HTTP tasks getting stuck on scheduled state ?

Thanks

2

There are 2 best solutions below

1
On

are some of your HTTP tasks longer tasks? These tasks might be using all of your available workers, placing some of the faster tasks into a queue.

You might consider isolation Groups for these longer HTTPS tasks so that the fast tasks can run through the regular HTTP workers:

https://conductor.netflix.com/configuration/isolationgroups.html

0
On

HTTP task do not need a worker, they usually go from scheduled to complete while bypassing running status. Based on my research the reason behind task stuck in scheduled is because of duplicates in dyno queues, which stop item from popping out and pushing to execution queue.