avoid run Install Task Runner step in EMR cluster

893 Views Asked by At

I hope you can help me. I am trying to create EMR cluster with hadoop and spark installed using datapipeline. The problem is this EMR is private so it does not have access to internet to download anything. In pipeline I indicate bootstrap actions to download all .jars and dependencies, including TaskRunner.jar.

EMRActivity of pipeline is to launch script.py

{
      "name": "DefaultEmrActivity1",
      "maximumRetries" : 0,
      "runsOn": {
        "ref": "EmrClusterId_lKm9y"
      },
      "id": "EmrActivityId_SRjHg",
      "type": "ShellCommandActivity",
      "command": "spark-submit --deploy-mode cluster --conf spark.yarn.submit.waitAppCompletion=true --py-files s3://emr/script.py"
    },

But this step is not running in my EMR cluster. Instead I see "Install TaskRunner" step that tries to install the jar from internet so it is failing.

taskRunner step command:

JAR location :s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://datapipeline-eu-west-1/eu-west-1/bootstrap-actions/latest/TaskRunner/install-remote-runner-v2   
--workerGroup=df-08684532KKW88TTUXHVS_@EmrClusterId_lKm9y_2021-05-07T07:22:56   
--endpoint=https://datapipeline.eu-west-1.amazonaws.com --region=eu-west-1   
--logUri=s3://aws-logs-351516419540-eu-west-1/pipeline/df-08684532KKW88TTUXHVS/EmrClusterId_lKm9y/@EmrClusterId_lKm9y_2021-05-07T07:22:56/@EmrClusterId_lKm9y_2021-05-07T07:22:56_Attempt=1/ --taskRunnerId=54ec5b53-884b-420d-b3e6-d0e518ddf448   
--zipFile=http://datapipeline-eu-west-1.s3.amazonaws.com/eu-west-1/software/latest/TaskRunner/TaskRunner-1.0.zip   
--mysqlFile=http://datapipeline-eu-west-1.s3.amazonaws.com/eu-west-1/software/latest/TaskRunner/mysql-connector-java-bin.jar   
--hiveCsvSerdeFile=http://datapipeline-eu-west-1.s3.amazonaws.com/eu-west-1/software/latest/TaskRunner/csv-serde.jar   
--proxyHost= --proxyPort=-1 --username= --password= --windowsDomain= --windowsWorkgroup= --releaseLabel=emr-6.2.0   
--jdbcDriverS3Path=s3://datapipeline-eu-west-1/eu-west-1/software/latest/TaskRunner/ --s3NoProxy=false
Action on failure:Terminate cluster

Error:

Connecting to datapipeline-eu-west-1.s3.amazonaws.com (datapipeline-eu-west-1.s3.amazonaws.com)|52.218.108.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16873 (16K) [application/octet-stream]
Saving to: ‘common/csv-serde.jar’

     0K .......... ......                                     100% 26.7M=0.001s

2021-05-07 07:30:44 (26.7 MB/s) - ‘common/csv-serde.jar’ saved [16873/16873]

+ '[' -n emr-6.2.0 ']'
+ sudo echo -e '\nexport HADOOP_CLASSPATH="$HADOOP_CLASSPATH:/mnt/taskRunner/common/mysql-connector-java-bin.jar:/etc/hadoop/hive/lib/hive-exec.jar"'
+ sudo tee -a /etc/hadoop/conf/hadoop-env.sh
+ bash /etc/hadoop/conf/hadoop-env.sh
+ '[' -z emr-6.2.0 ']'
+ unzip -o taskRunner.zip
+ chmod 500 aws-datapipeline-taskrunner-v2.sh
+ '[' -d /usr/share/aws/emr/goodies/lib ']'
+ '[' -n emr-6.2.0 ']'
+ EMR_HADOOP_GOODIES_NAME='emr-hadoop-goodies-*jar'
+ EMR_HIVE_GOODIES_NAME='emr-hive-goodies-*jar'
+ OPEN_CSV_PATH=/usr/lib/hive/lib/
++ find /usr/share/aws/emr/goodies/lib -name 'emr-hadoop-goodies-*jar'
+ emr_goodies_jar=/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies-3.1.0.jar
+ '[' -n /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies-3.1.0.jar ']'
+ open_csv_symlink=/mnt/taskRunner/open-csv.jar
+ emr_goodies_symlink=/mnt/taskRunner/oncluster-emr-hadoop-goodies.jar
+ emr_hive_goodies_symlink=/mnt/taskRunner/oncluster-emr-hive-goodies.jar
+ sudo rm -f /mnt/taskRunner/open-csv.jar
+ sudo rm -f /mnt/taskRunner/oncluster-emr-hadoop-goodies.jar
+ sudo rm -f /mnt/taskRunner/oncluster-emr-hive-goodies.jar
++ find /usr/share/aws/emr/goodies/lib -name 'emr-hive-goodies-*jar'
+ emr_hive_jar=/usr/share/aws/emr/goodies/lib/emr-hive-goodies-3.1.0.jar
++ find /usr/lib/hive/lib/ -name 'opencsv*jar'
+ open_csv_jar='/usr/lib/hive/lib/opencsv-2.3.jar
/usr/lib/hive/lib/opencsv-3.9.jar'
+ sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies-3.1.0.jar /mnt/taskRunner/oncluster-emr-hadoop-goodies.jar
+ sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hive-goodies-3.1.0.jar /mnt/taskRunner/oncluster-emr-hive-goodies.jar
+ sudo ln -s /usr/lib/hive/lib/opencsv-2.3.jar /usr/lib/hive/lib/opencsv-3.9.jar /mnt/taskRunner/open-csv.jar
ln: target ‘/mnt/taskRunner/open-csv.jar’ is not a directory
Command exiting with ret '1'

I don't know why the link can´t be created as EMR terminates in step failure and I can´t check it.
But I don't want this step to be executed as these jars will be installed in bootstrap. Any advice on how to avoid this step to run? Thanks

2

There are 2 best solutions below

2
On

If you take a look at open_csv_jar env var (open_csv_jar='/usr/lib/hive/lib/opencsv-2.3.jar /usr/lib/hive/lib/opencsv-3.9.jar') you gonna find that it has two versions. I don't know why, but if you try emr 0.6.1.0 it won't happen and cluster provision works perfectly.

1
On

When a Data Pipeline is created with an EmrCluster resource, it will launch a cluster with the pre-defined configuration and it will automatically run a step to install and run Task Runner (reference).

I came up with that error when running the step to install Task Runner. You may create an EMR cluster previously, install and run Task Runner on it, and then associate the cluster with the data pipeline when creating the data pipeline by using the workerGroup argument in the EmrActivity. This worked for me. An answer explaining how to do this is available here. And documentation is available here.