EMR serverless using Docker- how to install JAR files

897 Views Asked by At

I am trying to install EMR serverless, for which i have two options

  1. Using Terraform script - which let me chose initial size, max memory etc. however i do not have an option to install jar files / external libraries
  2. Using docker image - which doen't let me select initial size, max memory etc.
  3. I am thinking of using terraform script within docker however i dont know how to install JAR files on it. Can someone please share some thoughts.

Also my libraries are internal and written in JAVA / Scala

TY

I tried docker as well as TF

1

There are 1 best solutions below

0
On

First, I think it's good to clarify how EMR Serverless is different from other EMR deployment options. There are two main components to EMR Serverless:

  1. EMR Serverless application - This is the framework type (Hive/Spark), version (EMR 6.9.0 / Spark 3.3.0), and application properties including architecture (x86 or arm64), networking (VPC or not), custom images, and worker sizes.
  2. Jobs - This is the specific code for your job including runtime Jars or dependencies as well as a specific IAM role with permissions specific to the job itself.

There is not a cluster to install things onto and the infra (application) is typically separate from job submission. If you simply have one jar that is your job, you would upload that to S3 and include it as the --entrypoint to your start-job-run command and specify the main class with --class.

aws emr-serverless start-job-run \
    --application-id application-id \
    --execution-role-arn job-role-arn \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<BUCKET>/jars/spark-examples.jar",
            "entryPointArguments": ["1"],
            "sparkSubmitParameters": "--class org.co.YourMainClass"
        }
    }'

That said, it sounds like you want to include additional jars with your job. You have two options:

  1. Build a custom image that includes those jars in a path that is included by Spark like /usr/lib/spark.
  2. Include the jars with the job submission by either uploading them to S3 and providing the path to them with the --jars option or, if they're available in Maven, specifying the dependencies with the --packages option.

For option 1, when you create a new application (where you define worker sizes) you include the custom image you want to use. For example, the following CLI command creates an application with a custom image (--image-configuration) as well as your pre-initialized worker configuration (--initial-capacity):

aws emr-serverless create-application \
    --release-label emr-6.9.0 \
    --type SPARK \
    --image-configuration '{
        "imageUri": "aws-account-id.dkr.ecr.region.amazonaws.com/my-repository:tag/@digest"
    }' \
  --initial-capacity '{
    "DRIVER": {
        "workerCount": 5,
        "workerConfiguration": {
            "cpu": "2vCPU",
            "memory": "4GB"
        }
    },
    "EXECUTOR": {
        "workerCount": 50,
        "workerConfiguration": {
            "cpu": "4vCPU",
            "memory": "8GB"
        }
    }
  }'

For option 2, if you just have a single uberjar you want to use with your job, you upload that to S3 and provide it as your entrypoint to the start-job-run command:

aws emr-serverless start-job-run \
    --application-id <APPLICATION_ID> \
    --execution-role-arn <JOB_ROLE_ARN> \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<S3_BUCKET>/code/java-spark/java-demo-1.0.jar",
            "sparkSubmitParameters": "--class HelloWorld"
        }
    }'

If you want to specify maven dependencies, you can use --packages in the sparkSubmitParameters:

"sparkSubmitParameters": "--packages org.postgresql:postgresql:42.4.0"

If you upload additional jars to S3, you can also specify those using the --jars option.

"sparkSubmitParameters": "--jars s3://<S3_BUCKET>/jars/uber-jar-1.0-SNAPSHOT.jar"

There's some more info on these options in the emr-serverless-samples GitHub repo.