Does node with mapr client need to have an access to the files I want to copy with distcp?

194 Views Asked by At

Situation:
Node 0: mapr client installed, not a part of a cluster, no external resources mounted
Node 1 to 10 : mapr cluster nodes with mapr NodeManager installed. Every node has mounted external resources under /mnt/resource/

If I execute this code on any node from 1 to 10 - it works: hadoop distcp file:///mnt/resource/file maprfs:///tmp

When I execute the same code on Node 0, I get an error:

20/11/24 14:08:24 ERROR tools.DistCp: Invalid input: org.apache.hadoop.tools.CopyListing$InvalidInputException: file:///mnt/resource/file doesn't exist

What I expected is: Node 0 just calls YARN to manage the distcp command. But it looks like distcp tries to access /mnt/resource/file directly from Node 0

What I want to achieve is execute distcp command in a docker container but don't want to mount /mnt/resource directory to the container.

I have also tried to use disctp -f option and provide a file with /mnt/resource/file on a list, but the result is the same.

Have you got any idea how to execute it ? Or workaround?

1

There are 1 best solutions below

0
On

By far the easiest thing to do is to just land the data onto the mapr file system in the first place (now called HPE Ezmeral Data Fabric for short).

In many cases, that makes tasks like this completely moot.

The second easiest thing, if the task is somehow not amenable to such a solution, is to simply mount /mnt/resource and /mapr in the container (mounting is super quick, generally) and use cp or rsync. This is commonly faster than distcp for file sizes up to hundreds of MB. Even if it isn't technically faster, it is often so much easier that a moderate speed difference doesn't matter. Obviously this approach is funneling all of the data through one machine so it does have speed limits.

Thirdly, distcp should generally work as you expect it (that is, I pretty much expect that it would execute on whatever machines Hadoop is running on). There are lots of opportunities for gotchas, however. For instance, your node 0 might think that it is a cluster of 1 and not be sending the program to the cluster for execution at all. For another, there might be a code path in distcp that assumes that it can do argument checks on the invoking node that will determine whether execution will succeed. You can check Yarn queues and such to figure out what exactly happened.

I should also point out that if you have support, you can always ping the support team. Those love helping users. They even help me!