Cloud Data Fusion - Existing Dataproc option missing

305 Views Asked by At

According to the documentation there is an option to use an existing Dataproc cluster in 6.2 version and above.

We use Cloud Data Fusion 6.2.0 but the existing Dataproc does not appear when we try to create a new compute profile. no existing dataproc option

What are we doing wrong? Why does the described option not show up? Do we have to do some additional configurations?

UPDATE 1

When I choose Dataproc, I see the followings: enter image description here enter image description here

UPDATE 2

When we try to use Remote Hadoop Provisioner we got the following error message in the /logs/program.log file. SSH connection is successful because the run-id folder is there.


2021-06-15 09:40:37,617 - ERROR [main:o.a.z.s.NIOServerCnxnFactory@44] - Thread Thread[main,5,main] died
java.lang.reflect.InvocationTargetException: null
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_282]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_282]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_282]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_282]
        at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteLauncher.main(RemoteLauncher.java:73) ~[launcher.jar:na]
Caused by: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
        at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357) ~[hadoop-common-3.2.2.jar:na]
        at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338) ~[hadoop-common-3.2.2.jar:na]
        at io.cdap.cdap.common.conf.CConfigurationUtil.copyTxProperties(CConfigurationUtil.java:100) ~[na:na]
        at io.cdap.cdap.common.guice.ConfigModule.<init>(ConfigModule.java:62) ~[na:na]
        at io.cdap.cdap.common.guice.ConfigModule.<init>(ConfigModule.java:49) ~[na:na]
        at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionJobMain.initialize(RemoteExecutionJobMain.java:117) ~[na:na]
        at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionJobMain.doMain(RemoteExecutionJobMain.java:98) ~[na:na]
        at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionJobMain.main(RemoteExecutionJobMain.java:73) ~[na:na]
        ... 5 common frames omitted
2

There are 2 best solutions below

1
On

I wasn't able to reproduce exactly the scenario since when creating a CDF instance from the scratch I was able to select Cloud Data Fusion 6.2.3 as similar closer version.

I can confirm that on version 6.2.3 you have the option to choose an Existing Dataproc Cluster. Therefore I would recommend to you to upgrade to at least that version. Follow this docs in order to do it in a safe way.

As alternative there is a method to configure Cloud Data Fusion pipeline to run against existing cluster here. This feature is available only on the Enterprise edition of Cloud Data Fusion.

1
On

For 6.2.0 , "Remote Hadoop Provisioner" is the right option to use for existing dataproc cluster. And the stucking issue you met with is caused by a rare case where API activation failed to assign the necessary role to the Dataproc-specific service account. This problem can be solved simply by granting the following service account the "Dataproc Service Agent" role in your project:

service-${project number}@dataproc-accounts.iam.gserviceaccount.com