Planning to build a data platform with compute as Google Cloud Dataproc storing the data in delta tables (Deltalake).
Currently exploring the data catalog available in GCP stack along with open source Hive meta store and would like to clarify below questions:
- what is the difference between Google Cloud Data Catalog and Dataproc Metastore (https://cloud.google.com/dataproc-metastore/docs) ? Coming from AWS world, what is the equivalent of AWS Glue data catalog in GCP?
- If we migrate the application from GCP to other spark platforms (for ex: Databricks and any other), can we port/reuse the GCP data catalog/dataproc metastore already craeted?
- Where is the data catalog/dataproc metastore metadata stored? Is this GCS or any other storage?
- As per the documentation (https://cloud.google.com/data-catalog/docs/concepts/overview) , Google data catalog automatically catalogs the data in GCS,Bigquery,Pub/Sub. Does data catalog/dataproc metastore automatically captures metadata for delta tables on Google platform?
Difference between catalog and Dataproc metastore:
If we migrate the application from GCP to other spark platforms (for ex: Databricks and any other), can we port/reuse the GCP data catalog/dataproc metastore already craeted?
Where is the data catalog/dataproc metastore metadata stored? Is this GCS or any other storage?
Does data catalog/dataproc metastore automatically captures metadata for delta tables on Google platform?