I'm new to data governance, forgive me if question lack some information.
Objective
We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for our ETL processing, data science, ML & QA activities.
We already have about a hunder of input tables and 25 TB/yearly. In future we're expecting more.
Business has a strong requirements incline towards cloud-agnostic solutions. Still they are okay with Databricks since it's available on AWS and Azure.
Question
What is the best Data Governance solution for our stack and requirements?
My workarrounds
I haven't used any data governance solutions yet. I like AWS Data Lake solution, since it provide basic functionality out-of-the-box. AFAIK, Azure Data Catalog is outdated, because it doesn't support ADLS gen2.
After very quick googling I found three options:
- Databricks Privacera
- Databricks Immuta
- Apache Ranger & Apache Atlas.
Currently I'm not even sure if the 3rd option has full support for our Azure stack. Moreover, it will have much bigger development (infrastructure definition) effort. So is there any reasons I should look into Ranger/Atlas direction?
What are the reasons to prefer Privacera over Immuta and vice versa?
Are there any other options I should evaluate?
What is already done
From Data Governance perspective we have done only the following things:
- Define data zones inside ADLS
- Apply encryption/obfuscation for sensitive data (due to GDPR requirements).
- Implemented Row-Level Security (RLS) at Synapse and Power BI layers
- Custom audit framework for logging what & when was persisted
Things to be done
- Data lineage and single source of truth. Even at 4 months from the start, it become a pain-point to understand dependencies between data sets. The lineage information is stored inside Confluence, it's hard to maintain and continuously update in multiple places. Even now it's outdated in some places.
- Security. Business users may do some data exploration in Databricks Notebooks in future. We need RLS for Databricks.
- Data Life Cycle management.
- Maybe other data governance related stuff, such as data quality, etc.
Azure Purview is a new service and it would fit your data governance needs well. It is currently (2020-12-04) in public preview. It contains features you are looking in your question, e.g data lineage, and works well with the Azure services you are using (Synapse, Databricks, ADLSg2).
Purview is not a cloud agnostic solution. It exposes Apache Atlas API so some core capabilies and integrations could be run in any cloud. I would still categorize Purview as Azure specific solution.
Purview can manage hybrid data, e.g. data on-premise or other clouds. This way it is agnostic on where your data is. If you need to have some data or use-cases outside Azure, Purview will be able to manage these data assets too.
I saw that data quality features are on the Purview roadmap and will be available later. Also other governance topics will be covered later, e.g. policies.
More info on Purview here: https://azure.microsoft.com/en-us/services/purview/