When adding a custom jar step for an EMR cluster - how do you set the classpath to a dependent jar (required library)?
Let's say I have my jar file - myjar.jar but I need an external jar to run it - dependency.jar. Where do you configure this when creating the cluster? I am not using the command line, using the Advanced Options interface.
Thought I would post this after spending a number of hours poking around and reading outdated documentation.
The 2.x/3.x documentation that talks about setting the HADOOP_CLASSPATH does not work. They specify this does not work for 4.x and above anyway. Somewhere you need to specify a --libjars option. However, specifying that in the arguments list does not work either.
For example: Step Name: MyCustomStep Jar Location: s3://somebucket/myjar.jar Arguments: myclassname option1 option2 --libjars dependentlib.jar
Copy your required jars to /usr/lib/hadoop-mapreduce/ in a bootstrap action. No other changes are necessary. Additional info below:
This command below works for me to copy a specific JDBC driver version:
I have other dependencies so I have a bootstrap action for each jar I need copied, of course you could put all the copies in a single bash script. Below is .net code I use to get a bootstrap action to run the copy script. I am using .net SDK versions 3.3.* and launching the job with release label emr-5.2.0
Note that the ScriptBootstrapActionConfig Path property uses the protocol "s3n://", but the protocol for the aws cp command should be "s3://"
My script copy-thirdPartyJar.sh contains the following: