How do I run "s3-dist-cp" command inside pyspark shell / pyspark script in EMR 5.x

2.3k Views Asked by At

I had some problems in running a s3-dist-cp" command in my pyspark script as I needed some data movement from s3 to hdfs for performance enhancement. so here I am sharing this.

2

There are 2 best solutions below

0
On
import os

os.system("/usr/bin/s3-dist-cp --src=s3://aiqdatabucket/aiq-inputfiles/de_pulse_ip/latest/ --dest=/de_pulse/  --groupBy='.*(additional).*'  --targetSize=64 --outputCodec=none")

Note : - please make sure that you give the full path of s3-dist-cp like (/usr/bin/s3-dist-cp)

also, I think we can use subprocess.

0
On

If you're running a pyspark application, you'll have to stop the spark application first. The s3-dist-cp will hang because the pyspark application is blocking.

spark.stop()  # spark context
os.system("/usr/bin/s3-dist-cp ...")