I am using Airflow in my AWS EKS cluster. I've deployed it using Airflow Helm Chart (User Community), and I am using KubernetesExecutor.
Some of my DAGs run a task that does a ML training every once a week in the Airflow Worker. The worker in default will be a Kubernetes Pod defined in airflow.kubernetesPodTemplate.*
of values.yaml.
This training requires quite a lot of vCPUs and Memory (e.g., 24 vCPUs, 64GiB Memory), but doesn't necessarily take a lot of time (e.g., it ends in about an hour).
So, I want KubernetesExecutor to request for an EC2 node that meets above requirement (e.g., m5.8xlarge) when the DAG is triggered, and de-provision (or terminate) the node from the cluster after the task is finished.
I don't want an m5.8xlarge instance stay up all the time in my cluster just for an hour of training per week.
Is this possible?
It would be perfect if I can choose and configure different Operator for each DAG, since not all DAGs do ML training tasks, and if I can freely provision and de-provision nodes in which Workers (Kubernetes Pods) temporarily reside.
Where in the values.yaml should I change?