Google Cloud Data Catalog - Offerings and Flexibility

115 Views Asked by At

Planning to build a data platform with compute as Google Cloud Dataproc storing the data in delta tables (Deltalake).

Currently exploring the data catalog available in GCP stack along with open source Hive meta store and would like to clarify below questions:

  • what is the difference between Google Cloud Data Catalog and Dataproc Metastore (https://cloud.google.com/dataproc-metastore/docs) ? Coming from AWS world, what is the equivalent of AWS Glue data catalog in GCP?
  • If we migrate the application from GCP to other spark platforms (for ex: Databricks and any other), can we port/reuse the GCP data catalog/dataproc metastore already craeted?
  • Where is the data catalog/dataproc metastore metadata stored? Is this GCS or any other storage?
  • As per the documentation (https://cloud.google.com/data-catalog/docs/concepts/overview) , Google data catalog automatically catalogs the data in GCS,Bigquery,Pub/Sub. Does data catalog/dataproc metastore automatically captures metadata for delta tables on Google platform?
1

There are 1 best solutions below

1
On

Difference between catalog and Dataproc metastore:

If we migrate the application from GCP to other spark platforms (for ex: Databricks and any other), can we port/reuse the GCP data catalog/dataproc metastore already craeted?

  • You should be able to ideally use the Dataproc metastore

Where is the data catalog/dataproc metastore metadata stored? Is this GCS or any other storage?

  • Both are Google proprietary native services - you would need to export out the metadata from DPMS / Google cloud catalog.

Does data catalog/dataproc metastore automatically captures metadata for delta tables on Google platform?

  • No