Airflow spinning up multiple subprocess for a single task and hanging

421 Views Asked by At

Airflow version = 1.10.10

Hosted on Kubernetes, Uses Kubernetes executor.

DAG setup

DAG - Is generated using the dynamic dag

Task - Is a PythonOperator that pulls some data, runs an inference, stores the predictions.

Where does it hang? - When running the inference using tensorflow

More details

One of our running tasks, as mentioned above, was hanging for 4 hours. No amount of restarting can help it to recover from that point. We found out that the pod had almost 30+ subprocess and 40GB of memory used.

enter image description here

We weren't convinced because when running on a local machine, the model doesn't consume more than 400MB. There is no way it can suddenly bump up to 40GB in memory.

Another suspicion was maybe it's spinning up so many processes because we are dynamically generating around 19 DAGS. I changed the generator to generate only 1, and the processes didn't vanish. The worker pods still had 35+ subprocesses with the same memory.

Here comes the interesting part, I wanted to be really sure that it's not the dynamic DAG. Hence I created an independent DAG that prints out 1..100000 while pausing for 5 seconds each. The memory usage was still the same but not the number of processes.

enter image description here

At this point, I am not sure which direction to take to debug the issue further.

Questions

  1. Why is the task hanging?
  2. Why are there so many sub-processes when using dynamic dag?
  3. How can I debug this issue further?
  4. Have you faced this before, and can you help?
0

There are 0 best solutions below