How to generate DBT data lineage graphs in client's production environment?

1.5k Views Asked by At

Our project runs on client's infrastructure wherein infra is managed via Kubernetes and Terraform. We automate our jobs using Airflow.

Any Airflow with DBT runs using KubernetesPodOperator provided in Airflow. We plan to create data lineage graphs for each client's tables.

I saw this link

How to setup dbt UI for data lineage?

and using the 2 below commands, I can generate the DBT data docs in my local machine.

dbt docs generage
dbt docs serve --port 8081

Now I need to generate the same at any client's location. So I have written the DAG as shown below:

sync_data_lineage = KubernetesPodOperator(namespace='etl',
                                                 image=f'blotout/dbt-analytics:{TAG_DBT_VERSION}',
                                                 cmds=["/usr/local/bin/dbt"],
                                                 arguments=['docs', 'generate'],
                                                 env_vars=env_var,
                                                 name="sync_data_lineage",
                                                 configmaps=['awskey'],
                                                 task_id="sync_data_lineage",
                                                 get_logs=True,
                                                 dag=dag,
                                                 is_delete_operator_pod=True,
                                                 )

    deploy_lineage_graph = KubernetesPodOperator(namespace='etl',
                                              image=f'blotout/dbt-analytics:{TAG_DBT_VERSION}',
                                              cmds=["/usr/local/bin/dbt"],
                                              arguments=['docs', 'serve', '--port', '8081'],
                                              env_vars=env_var,
                                              name="deploy_lineage_graph",
                                              configmaps=['awskey'],
                                              task_id="deploy_lineage_graph",
                                              get_logs=True,
                                              dag=dag,
                                              is_delete_operator_pod=True,
                                              )

sync_data_lineage >> deploy_lineage_graph

Now the first task runs successfully but when second one runs, it does not find catalog.json which is created by first task 'sync_data_lineage'. The reason being once the first DBT command runs and generates the catalog.json, the pod is destroyed. The second task runs in a second pod and hence not able to deploy docs as catalog.json in missing from first step.

How can I resolve this?

2

There are 2 best solutions below

0
gunn On

Try saving DBT artifacts on S3 or other external storage.

0
ifoukarakis On

dbt docs generate command will generate project documentation (including lineage) and store it under target/ directory.

dbt docs serve --port 8081 will start an HTTP server serving the static files. Running it in a kubernetes operator will start a pod that doesn't complete it's work.

Since the documentation is a static site, it can be served using any hosting solution. The Airflow operator can be modified to run a script that automates the process of publishing the contents of target/ to that server, stores them in an object storage etc.

EDIT:

DBT docs on documentation suggest against using dbt docs serve, as it's intended for local/development hosting. Docs also suggest valid alternatives.