To state my problem
1) I want to backup our cdh hadoop cluster to s3
2) We have an emr cluster running
3) I am trying to run s3distcp from emr cluster giving src as hdfs URL of the cdh remote cluster and destination as s3 .
Having following error : Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=mapred, access=RE AD_EXECUTE, inode="/tmp/hadoop-mapred/mapred/staging"
Following are my questions after going through documentation here
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
1)Is this doable . I can see from the s3distcp documentation that any hdfs url can be given . But I cant find any documentation as to how it would work in case of external cluster .
2) I would like to know where the staging directory(It was mentioned that s3distcp copies data to this directory before copying to s3) , which is mentioned in the documentation is created i.e, in remote cluster or the emr cluster .
It is definitely possible. It's hard to say without seeing your distcp command. Here's some general info...
We build a fairly sophisticated process that backed up our CDH cluster to S3. We didn't have to do anything special to deal with staging directories. We used the distcp included in the CDH distro, and it works fine.
This all runs from a shell script. The key command we worked out is
With these variables set first:
The
distcp_opts
are just what we found was eventually reliable for us.We have now moved to an EMR process, and have only a few residual processes on CDH. But it's working fine still. Within the EMR cluster, we use the AWS
s3-dist-cp
command and it is more powerful and capable than the apache version we use. It's probably worth making available on your cluster and trying.