Permission Issue in using s3 dist cp to copy data from a non emr cluster to s3

1.6k Views Asked by At

To state my problem
1) I want to backup our cdh hadoop cluster to s3
2) We have an emr cluster running
3) I am trying to run s3distcp from emr cluster giving src as hdfs URL of the cdh remote cluster and destination as s3 .

Having following error : Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=mapred, access=RE AD_EXECUTE, inode="/tmp/hadoop-mapred/mapred/staging"

Following are my questions after going through documentation here

http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html

1)Is this doable . I can see from the s3distcp documentation that any hdfs url can be given . But I cant find any documentation as to how it would work in case of external cluster .

2) I would like to know where the staging directory(It was mentioned that s3distcp copies data to this directory before copying to s3) , which is mentioned in the documentation is created i.e, in remote cluster or the emr cluster .

2

There are 2 best solutions below

0
On

It is definitely possible. It's hard to say without seeing your distcp command. Here's some general info...

We build a fairly sophisticated process that backed up our CDH cluster to S3. We didn't have to do anything special to deal with staging directories. We used the distcp included in the CDH distro, and it works fine.

This all runs from a shell script. The key command we worked out is

hadoop distcp $distcp_opts -m 20 -numListstatusThreads 15 -strategy dynamic -update -delete $distcp_source $distcp_target

With these variables set first:

distcp_opts="-Dfs.s3a.multipart.uploads.enabled=false -Dmapreduce.map.memory.mb=5000 -Dmapreduce.task.timeout=2400000 -Dmapreduce.map.maxattempts=8 -Dmapreduce.reduce.maxattempts=8 -Dfs.s3a.access.key=$AWS_ACCESS_KEY_ID -Dfs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY"

distcp_source="hdfs://nameservice1/foo/$table/"
distcp_target="s3a://my-aws-bucket/foo/$table"

The distcp_opts are just what we found was eventually reliable for us.

We have now moved to an EMR process, and have only a few residual processes on CDH. But it's working fine still. Within the EMR cluster, we use the AWS s3-dist-cp command and it is more powerful and capable than the apache version we use. It's probably worth making available on your cluster and trying.

0
On

Distcp likes to work on the cluster publishing data (putting) over getting; if you have kerberos on one of the clusters, it needs to be running on that one.

For your task, unless there's a VPN so that the the EMR cluster can see the other one, you won't get access. Given its a permissions thing, I'd suspect kerberos or other auth rather than connectivity.