Airflow version = 1.10.10
Hosted on Kubernetes, Uses Kubernetes executor.
DAG setup
DAG - Is generated using the dynamic dag
Task - Is a PythonOperator that pulls some data, runs an inference, stores the predictions.
Where does it hang? - When running the inference using tensorflow
More details
One of our running tasks, as mentioned above, was hanging for 4 hours. No amount of restarting can help it to recover from that point. We found out that the pod had almost 30+ subprocess and 40GB of memory used.
We weren't convinced because when running on a local machine, the model doesn't consume more than 400MB. There is no way it can suddenly bump up to 40GB in memory.
Another suspicion was maybe it's spinning up so many processes because we are dynamically generating around 19 DAGS. I changed the generator to generate only 1, and the processes didn't vanish. The worker pods still had 35+ subprocesses with the same memory.
Here comes the interesting part, I wanted to be really sure that it's not the dynamic DAG. Hence I created an independent DAG that prints out 1..100000 while pausing for 5 seconds each. The memory usage was still the same but not the number of processes.
At this point, I am not sure which direction to take to debug the issue further.
Questions
- Why is the task hanging?
- Why are there so many sub-processes when using dynamic dag?
- How can I debug this issue further?
- Have you faced this before, and can you help?