Moving data from Cloudera to Amazon S3 bucket

1k Views Asked by At

I have data on CDH hdfs and I want to move it to Amazon S3 bucket, so I can run the code on AWS EMR instead of CDH. How can I move it securely and fast?

Can I do it with s3a command or any other efficient way to do it?

1

There are 1 best solutions below

2
On

I use hdfs distcp to copy data from S3 to hdfs. It also supports vice versa so should work in your case as well. Since it uses map reduce internally and does parallel processing its pretty fast. I created a script for running this command for an array of dates and then run it using nohup in background mode. Syntax of command is :

hadoop distcp -Dfs.s3n.awsAccessKeyId=$S3NKEYID -      Dfs.s3n.awsSecretAccessKey=$S3NKEY s3n://$COPYFROMENV/$TABLE_PATH/$TABLE/$PARTITION_PATH hdfs://$COPYTOENV/$TABLE_PATH/$TABLE/