Need help to fix org.apache.hadoop.ipc.RemoteException - AWS EMR Spark Scala Application

1k Views Asked by At

I am running a Spark/Scala App on AWS EMR - 12 node cluster. I have multiple transformations happening where i write to HDFS and read back from hdfs to complete transformations and finally write to S3.

During one of these transformations i recently started to get the following error"

2018-08-10 20:05:31,106 [task-result-getter-2] WARN  org.apache.spark.scheduler.TaskSetManager:66 - Lost task 44.0 in stage 30.0 (TID 4300, IP-address-here, executor 5): org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hadoop/working/xml/SD5256.20171030.5251246b-5475.xml.__temp could only be replicated to 0 nodes instead of minReplication (=1).  There are 11 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1735)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2561)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:829)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:847)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:790)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2486)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
at org.apache.hadoop.ipc.Client.call(Client.java:1435)
at org.apache.hadoop.ipc.Client.call(Client.java:1345)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy38.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:444)
at sun.reflect.GeneratedMethodAccessor239.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
at com.sun.proxy.$Proxy39.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1838)
at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1638)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)

Based on some articles and forum member comments, i updated the hdfs-site.xml by adding the following configuration:

    <property>
    <name>dfs.client.use.datanode.hostname</name>
    <value>true</value>
  </property>
    <property>
    <name>dfs.datanode.use.datanode.hostname</name>
    <value>true</value>
  </property>

Can someone help me understand why i am getting this error? and what configuration do i need to update in hdfs-site.xml to address this issue. Any help is appreciated.

1

There are 1 best solutions below

1
On

I think could be due to 1.Because of multiple transformation your job needs to open more files which might be exceeding the maximum number of open files specified in the ulimit. 2.your jobs are trying to write to the same HDFS file concurrently. HDFS does not allow concurrent writes to the same file.

Possible Solutions- 1. At any given point of time an HDFS file can have only one writer connection. you need ensure that your are not writing to same HDFS files. 2.consider increasing the ulimit maximum number of open files for the user runs the job on all nodes.