Function of Dataproc Metastore in a Datalake environment

645 Views Asked by At

In a Google Datalake environment, what is the Dataproc Metastore service used for?

I'm watching a Google Cloud Tech video and in this video around the 17:33 mark, the presenter says:

The other thing that is required, in order to make the data accessible, is all of the data that exists within GCS bucket, so on the BigQuery. So recent launched Dataproc Metastore So this is a high-compatible Metastore. It's based on high Metastore. So this will allow you to register the data, structured data specifically, so that you can query it using Spark, PRESTO, or Hive

The way I understood this quote is, if I have a BigQuery table called my_bigquery_table I should be able to run the following Hive query (or similar) and get an output:

SELECT * FROM my_bigquery_table;

As far as I know, this would only work if my Metastore is able to extract entries from the Data Catalog regarding my BigQuery tables.

Is my understanding correct? Currently, I am unable to find a way to sync entries in my Data Catalog to the Metastore (syncing data from Metastore to Data Catalog is possible, I know this).

UPDATE 1:

Is this a valid example to load Parquet/CSV files from GCS:

CREATE EXTERNAL TABLE sample_table(<column list>) STORED AS PARQUET LOCATION 'gs://parquet_bucket/parquet_file'; // table creation

SELECT * FROM sample_table; // select query
1

There are 1 best solutions below

3
On

Dataproc Metastore is a managed Apache Hive Metastore service. It offers 100% OSS compatibility when accessing database and table metadata stored in the service.

For example, you might have a table stored in Parquet files on Google Cloud Storage. You can define a table over those files and store that metadata in a Dataproc Metastore instance. Then you can connect a Cloud Dataproc cluster to your Dataproc Metastore service instance and query that table using Hive, SparkSQL , or other query engines.

Dataproc Metastore is also used to provide an OSS compatible metadata API for Spark SQL to query data that has been discovered by Cloud Dataplex. See this documentation for more info.

Dataproc Metastore does not automatically ingest metadata from Data Catalog or make BigQuery tables automatically queryable from Hive.