How can i see metadata, lineage of data stored in AWS redshift?

5.6k Views Asked by At

I am using solutions like cloudera navigator, atlas and Wherehows

to get Hadoop, HDFS, HIVE, SQOOP, MAPREDUCE metadata and lineage.

Now we have a data warehouse in AWS redshift as well. Is there a way to extract metadata or lineage or both information out of redshift.

So far i have not found anything on this.

Is there a way to integrate the same to wherehows as a crawled solution?

I found only one post which gives some information about how to get some information from redshift assuming it will be similar to postgresql. I am sure someone would have written some open source solution to this problem. Or is it just matter of writing a simple single script to extract this information? I am looking for a enterprise level solution. I hope someone will point me in right direction.

2

There are 2 best solutions below

0
On

You can access metadata by querying the system tables in Redshift:

https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_system-tables.html

The system tables are on the leader node in each cluster (see this guide on the Redshift Architecture that I wrote)

Redshift deletes the content of the system tables on a rolling basis, so you need to store that data in your cluster, or another separate cluster, to get a history. With the data in the system tables, you have a baseline of information about your queries and what tables they are touching.

You can put a dashboard like Kibana or Periscope Data on top of that data to visualize it. Plaid has done a write-up of how they've built an in-house monitoring solution that has some information about data lineage:

https://blog.plaid.com/managing-your-amazon-redshift-performance-how-plaid-uses-periscope-data/

But go get true data lineage, you need to understand how queries relate to your workflows, i.e. for an Airflow DAG. To get that information, you need to "tag" your queries so you can trace them in the context of transformations / workflows, vs. looking at the individual query.

This is something we've built into our product - heads up that it's a commercial solution:

https://www.intermix.io/blog/announcing-query-insights/

Unlike the raw logs from the system tables, we give you the context of what apps / workflows are triggering queries, which users are running them, and what tables they are touching.

  • Lars
0
On

AWS Glue Data catalog is a fully managed metadata management service.It has AWS Glue crawler which automatically crawls through your source(for you its redshift) and creates a centralized metadata repository which can be accessed by other AWS services.

Refer:

https://docs.aws.amazon.com/glue/latest/dg/components-overview.html

https://aws.amazon.com/glue/