Have been looking for cluster configs in JSON format to create a dataproc cluster(GCE) with Dataproc Metastore service and Spark-BQ dependency jars, unable to find any reference document that specifies how to use those JSON configs.
I have looked through below links : https://airflow.apache.org/docs/apache-airflow/1.10.6/_api/airflow/contrib/operators/dataproc_operator/index.html https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters https://cloud.google.com/dataproc/docs/reference/rest/v1/MetastoreConfig
but it does not specify GCE cluster configs, its REST API and GKE cluster configs Please see below configs that I am trying out to create a dataproc cluster :
CLUSTER_CONFIG = {
"gce_cluster_config": {
"internal_ip_only": True,
"metadata": {
"spark-bigquery-connector-version": spark_bq_connector_version
},
"service_account_scopes": [
service_account_scopes
],
"subnetwork_uri": subnetwork_uri,
"zone_uri": zone_uri
},
"initialization_actions": [
{
"executable_file": initialization_actions,
"execution_timeout": execution_timeout
}
],
"master_config": {
"disk_config": {
"boot_disk_size_gb": master_boot_disk_size_gb
},
"machine_type_uri": master_machine_type_uri
},
"metastore_config": {
"dataproc_metastore_service": dataproc_metastore
},
"software_config": {
"image_version": software_image_version
},
"worker_config": {
"disk_config": {
"boot_disk_size_gb": worker_boot_disk_size_gb
},
"machine_type_uri": worker_machine_type_uri,
"num_instances": worker_num_instances
}
}
Any lead would be really appreciated, please attach links to refer full config examples
Thanks !
As mentioned in this doc, external Hive metastore (non Dataproc Metastore service) needs to be specified through the
hive:hive.metastore.uris
property. Note thehive:
prefix.When creating the cluster with gcloud, if you add
--log-http
:it will show you the actual HTTP request:
You can also find the request spec in the Dataproc REST API doc.