I would like to use airflow for the following use-case :
- Compute a daily report for a given website (~150 websites to handle). Each report will be computed as follows:
- A set of tasks that should be run at site level,
- A set of tasks that should be run at page level, each website contanining ~ 10k pages.
- Once both sets of tasks above are performed, a third set of tasks are run to aggregate the results and generate the report.
Note : each airflow task described here is in fact a simple call to a remote micro-service (grpc call).
The design I have in mind so far :
- I initially wanted to perform all processes related to the pages in one task in order to have a simple, well-defined dag with only a few tasks. But the treatment that needs to be performed on the page is complex, with external dependencies and queues (only trigger the next task if you receive notifications from external systems, those notifications may arrive several hours later) => I would like to use airflow to handle this process.
- Given the point above, I'm now inclined towards a model whereby all the processes for one website are embeedded in one dag, including the tasks for the pages. Ideally I would like to use a subdag for the tasks related to pages but from what I read so far, this feature is not yet stable. Each website will generate a new dag, with a new set of tasks (because the structure of the dag depends on the number of pages). The number of tasks per dag will therefore be relatively important (10k).
My questions :
- Is airflow an acceptable framework for this use case (i.e did you run similar use cases) or do alternative frameworks such as luigi, oozie ... present clear advantages in that context ?
- Is the approach above (one dag per website, no subdag, include page tasks in the dag) a sound one ? Do you foresee any issue with this ?
- Is the web ui still usable with that number of tasks ? I did a quick test with a few hundreds tasks and I got several timeouts, I'm wondering if it is linked to my configuration or not.
- Is celery the correct backend for this ? I'm wondering if "LocalExecutor" would in fact be more appropriate for this use case, given that there is in fact no computation performed directly by the airflow workers (they only call remote services).
Your initial idea was the one I would go with. Having 150 different workflows with 10K tasks each leads to a fully dynamic and unmanageable scenario. On the one hand you say that each task is just a simple gRPC but at the same time you mention that the page-level tasks are really complex to encapsulate behind a single task and there are external dependencies that may cause flow bottlenecks measured in hours.
If I were you I'd redesign the solution and transfer the page-level reporting to a different layer. For example creating a service that would do all these complex calculations would be a better option than trying to implement this in Airflow. This way you could probably cut down the number of page level tasks significantly.
Regarding your specific questions:
If I were you I'd have a single workflow for all 150 sites. I'd create a subdag for each website (btw there is no mention of the word 'unstable' in the official docs) and try to offload complex calculation operations to a different layer in order to cut down on the number of page level tasks as much as possible.