How can I execute a S3-dist-cp command within a spark-submit application

2.2k Views Asked by At

I have a jar file that is being provided to spark-submit.With in the method in a jar. I’m trying to do a

Import sys.process._
s3-dist-cp —src hdfs:///tasks/ —dest s3://<destination-bucket>

I also installed s3-dist-cp on all salves along with master. The application starts and succeeded without error but does not move the data to S3.

2

There are 2 best solutions below

1
On BEST ANSWER

s3-dist-cp is now a default thing on the Master node of the EMR cluster.

I was able to do an s3-dist-cp from with in the spark-submit successfully if the spark application is submitted in "client" mode.

2
On

This isn't a proper direct answer to your question, but I've used hadoop distcp (https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html) instead and it sucessfully moved the data. In my tests it's quite slow compared to spark.write.parquet(path) though (when accounting in the time taken by the additional write to hdfs that is required in order to use hadoop distcp). I'm also very interested in the answer to your question though; I think s3-dist-cp might be faster given the aditional optimizations done by Amazon.