Airflow - where to place Dataset definition

21 Views Asked by At

We are using the official airflow helm chart and deployed it to kubernetes, our setup uses KubernetesExecutor and git-sync. We use one core git repository as a default location for DAGs. Every project gets its own git repository, which is then added as a git submodule to the core repo.
With this setup in mind, where do you place an airflow Dataset to be able to run data-aware scheduling, if the Datasets are shared across the projects? For example, projectA and projectB both use the same Dataset, where do I code the definition?
I had a couple of ideas, but each of them has a drawback:

  • module inside the core repo - now I would have to include the core repo in every project CI/CD pipeline, because we test each project independently
  • module inside each project's repo - a lot of copy pasting, not convinced this is the best idea
  • separate python package - not sure if this would work, I am concerned whether airflow can pick up the Datasets seamlessly. Also how to handle cases if one project has different version of the package than others?
  • git submodule added to each repo - again the concern about different checked out commits is present

Not really sure what the best course of action is here, any feedback is appreciated!

0

There are 0 best solutions below