I am trying to install EMR serverless, for which i have two options
- Using Terraform script - which let me chose initial size, max memory etc. however i do not have an option to install jar files / external libraries
- Using docker image - which doen't let me select initial size, max memory etc.
- I am thinking of using terraform script within docker however i dont know how to install JAR files on it. Can someone please share some thoughts.
Also my libraries are internal and written in JAVA / Scala
TY
I tried docker as well as TF
First, I think it's good to clarify how EMR Serverless is different from other EMR deployment options. There are two main components to EMR Serverless:
There is not a cluster to install things onto and the infra (application) is typically separate from job submission. If you simply have one jar that is your job, you would upload that to S3 and include it as the
--entrypoint
to yourstart-job-run
command and specify the main class with--class
.That said, it sounds like you want to include additional jars with your job. You have two options:
/usr/lib/spark
.--jars
option or, if they're available in Maven, specifying the dependencies with the--packages
option.For option 1, when you create a new application (where you define worker sizes) you include the custom image you want to use. For example, the following CLI command creates an application with a custom image (
--image-configuration
) as well as your pre-initialized worker configuration (--initial-capacity
):For option 2, if you just have a single uberjar you want to use with your job, you upload that to S3 and provide it as your entrypoint to the
start-job-run
command:If you want to specify maven dependencies, you can use
--packages
in thesparkSubmitParameters
:If you upload additional jars to S3, you can also specify those using the
--jars
option.There's some more info on these options in the emr-serverless-samples GitHub repo.