I have setup a Dataproc cluster on Google Cloud. It is sup and running and I can access HDFS and copy files from the SSH 'in browser" console. So the problem is not on the Dataproc side.
I am now using Pentaho (ELT software) to copy files. Pentaho needs to access the Master and the Data Nodes.
I have the following error message :
456829 [Thread-143] WARN org.apache.hadoop.hdfs.DataStreamer - Abandoning BP-1097611520-10.132.0.7- 1611589405814:blk_1073741911_1087
456857 [Thread-143] WARN org.apache.hadoop.hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[10.132.0.9:9866,DS-6586e84b-cdfd-4afb-836a-25348a5080cb,DISK]
456870 [Thread-143] WARN org.apache.hadoop.hdfs.DataStreamer - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/jmonteilx/pentaho-shim-test-file.test could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and 2 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1819)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2569)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
The IP address used in the log is the internal IP of my firs datanode in Dataproc. I need to use the External IP.
My question is the following,
Anything to change in the config files in the client file to do so ?
I have tried :
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
Without success, Many thanks,
ETL tool can not access DataNodes via external IPs from the on premise data center because probably your firewall rules block access from internet or you created Dataproc cluster with internal IPs.
That said, allowing access to HDFS from internet is a security risk. By default Dataproc cluster do not configure secure authentication w/ Kerberos, so if you decide to open up cluster to internet you at very least should configure secure access to it.
Preferred solution is to establish secure network connection between on premise and GCP clusters and access HDFS via it. You can read more about options for this in GCP documentation.