we are using airflow on kubernetes(EKS). We currently using kubernetes pod executor.
We are using spot node. So any task on pod can receive sigterm signal due to limitation of spot node.
In this environment, we occasionally found a problem where the task that senses the External Task(ExternalTaskSensor) was scheduled but not executed.
So we check airflow scheduler log. And also check base_executor.py code.
[2024-03-20T09:06:01.339+0000] {base_executor.py:279} INFO - queued but still running; attempt=10 task=TaskInstanceKey(dag_id='mart_kids_report_daily_10min', task_id='sensor_json_10min', run_id='scheduled__2024-03-20T08:50:00+00:00', try_number=1, map_index=-1)
[2024-03-20T09:06:02.833+0000] {base_executor.py:282} ERROR - could not queue task TaskInstanceKey(dag_id='mart_kids_report_daily_10min', task_id='sensor_json_10min', run_id='scheduled__2024-03-20T08:50:00+00:00', try_number=1, map_index=-1) (still running after 10 attempts)
The scheduler put the task in the queue, and it was removed from the queue for execution, but the execution record of the task could not be found in any pod.
Gantt Chart So this is gantt chart. We saw that the sensing task did not run indefinitely.
What's the problem?
Additional Information
airflow version : 2.7.3
ExternalTaskSensor configuration:
sensor_source = ExternalTaskSensor(task_id='sensor_json_10min',
dag=dag,
external_dag_id='mart_json_10min',
external_task_id='merge_cluster',
execution_date_fn=lambda y: y,
mode='reschedule')
We try to analyze the logs.