Unable to find DataprocCreateClusterOperator configs : Dataproc Metastore

577 Views Asked by At

Have been looking for cluster configs in JSON format to create a dataproc cluster(GCE) with Dataproc Metastore service and Spark-BQ dependency jars, unable to find any reference document that specifies how to use those JSON configs.

I have looked through below links : https://airflow.apache.org/docs/apache-airflow/1.10.6/_api/airflow/contrib/operators/dataproc_operator/index.html https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters https://cloud.google.com/dataproc/docs/reference/rest/v1/MetastoreConfig

but it does not specify GCE cluster configs, its REST API and GKE cluster configs Please see below configs that I am trying out to create a dataproc cluster :

CLUSTER_CONFIG = {
    "gce_cluster_config": {
        "internal_ip_only": True,
        "metadata": {
            "spark-bigquery-connector-version": spark_bq_connector_version
        },
        "service_account_scopes": [
            service_account_scopes
        ],
        "subnetwork_uri": subnetwork_uri,
        "zone_uri": zone_uri
    },
    "initialization_actions": [
        {
            "executable_file": initialization_actions,
            "execution_timeout": execution_timeout
        }
    ],
    "master_config": {
        "disk_config": {
            "boot_disk_size_gb": master_boot_disk_size_gb
        },
        "machine_type_uri": master_machine_type_uri
    },
    "metastore_config": {
        "dataproc_metastore_service": dataproc_metastore
    },
    "software_config": {
        "image_version": software_image_version
    },
    "worker_config": {
        "disk_config": {
            "boot_disk_size_gb": worker_boot_disk_size_gb
        },
        "machine_type_uri": worker_machine_type_uri,
        "num_instances": worker_num_instances
    }
}

Any lead would be really appreciated, please attach links to refer full config examples

Thanks !

1

There are 1 best solutions below

3
On

As mentioned in this doc, external Hive metastore (non Dataproc Metastore service) needs to be specified through the hive:hive.metastore.uris property. Note the hive: prefix.

When creating the cluster with gcloud, if you add --log-http:

$ gcloud dataproc clusters create ... \
    --properties hive:hive.metastore.uris=thrift://my-metastore:9083 \
    --log-http

it will show you the actual HTTP request:

{
   "clusterName":"...",
   "config":{
      "endpointConfig":{
         "enableHttpPortAccess":true
      },
      "gceClusterConfig":{
         "internalIpOnly":false,
         "serviceAccountScopes":[
            "https://www.googleapis.com/auth/cloud-platform"
         ],
         "zoneUri":"us-west1-a"
      },
      "masterConfig":{
         "diskConfig":{
            
         },
         "machineTypeUri":"e2-standard-2"
      },
      "softwareConfig":{
         "imageVersion":"1.5",
         "properties":{
            "hive:hive.metastore.uris":"thrift://my-metastore:9083"
         }
      },
      "workerConfig":{
         "diskConfig":{
            
         },
         "machineTypeUri":"e2-standard-2"
      }
   },
   "projectId":"..."
}

You can also find the request spec in the Dataproc REST API doc.