How can I execute a S3-dist-cp command within a spark-submit application

2.2k Views Asked by Ram At 24 June 2025 at 00:13

I have a jar file that is being provided to spark-submit.With in the method in a jar. I’m trying to do a

Import sys.process._
s3-dist-cp —src hdfs:///tasks/ —dest s3://<destination-bucket>

I also installed s3-dist-cp on all salves along with master. The application starts and succeeded without error but does not move the data to S3.

Original Q&A

There are 2 best solutions below

Ram On 11 January 2019 at 04:56 BEST ANSWER

s3-dist-cp is now a default thing on the Master node of the EMR cluster.

I was able to do an s3-dist-cp from with in the spark-submit successfully if the spark application is submitted in "client" mode.

habit On 02 January 2019 at 20:42

This isn't a proper direct answer to your question, but I've used hadoop distcp (https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html) instead and it sucessfully moved the data. In my tests it's quite slow compared to spark.write.parquet(path) though (when accounting in the time taken by the additional write to hdfs that is required in order to use hadoop distcp). I'm also very interested in the answer to your question though; I think s3-dist-cp might be faster given the aditional optimizations done by Amazon.

How can I execute a S3-dist-cp command within a spark-submit application

There are 2 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in BIGDATA

Related Questions in SPARK-SUBMIT

Related Questions in S3DISTCP

Trending Questions

Popular # Hahtags

Popular Questions