failing to run hudi deltastreamer on emr on eks

27 Views Asked by At

I'm trying to run Apache Hudi on Amazon EMR on EKS using the aws emr-containers start-job-run command, but I'm encountering a NoSuchMethodError with the following error message:

pod emr-on-eks-spark.spark-000000033dvo7gou032-driver exited with code 1 Error: 24/02/19 05:09:43 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 9873) (10.3.0.113 executor 1): java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf$.LEGACY_AVRO_REBASE_MODE_IN_WRITE()Lorg/apache/spark/internal/config/ConfigEntry;
24/02/19 05:10:19 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 9874) (10.3.0.113 executor 1): java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf$.LEGACY_AVRO_REBASE_MODE_IN_WRITE()Lorg/apache/spark/internal/config/ConfigEntry;
24/02/19 05:10:57 WARN TaskSetManager: Lost task 0.2 in stage 2.0 (TID 9875) (10.3.1.191 executor 2): java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf$.LEGACY_AVRO_REBASE_MODE_IN_WRITE()Lorg/apache/spark/internal/config/ConfigEntry;

The command I am using is:

aws emr-containers start-job-run \
--name=orders \
--virtual-cluster-id <clusterId> \
--region us-east-1 \
--execution-role-arn arn:aws:I am::<accountId>:role/execution-role \
--release-label=emr-6.10.0-latest \
--job-driver='{
    "sparkSubmitJobDriver": {
        "entryPoint":"s3://<bucket_location>/hudi-utilities-bundle_2.12-0.12.2.jar",
        "entryPointArguments": [
            "--table-type","COPY_ON_WRITE",
            "--source-ordering-field","created_time",
            "--props",""s3://<bucket_location>/config/orders.properties",
            "--source-class","org.apache.hudi.utilities.sources.ParquetDFSSource",
            "--target-table","orders",
            "--target-base-path","s3://<bucket_location>/orders",
            "--transformer-class","org.apache.hudi.utilities.transform.AWSDmsTransformer",
            "--transformer-class","org.apache.hudi.utilities.transform.SqlQueryBasedTransformer",
            "--schemaprovider-class","org.apache.hudi.utilities.schema.FilebasedSchemaProvider",
            "--payload-class","org.apache.hudi.payload.AWSDmsAvroPayload",
            "--op","UPSERT"
        ],
        "sparkSubmitParameters": "--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages org.apache.spark:spark-avro_2.12:3.5.0 --jars s3://<bucket_location>/config/hudi-utilities-bundle_2.12-0.12.2.jar,s3://<bucket_location>/config/hudi-spark3.3-bundle_2.12-0.12.2.jar --conf spark.driver.memory=2G --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.sql.catalogImplementation=hive --conf spark.serializer=org.apache.spark.serializer.KryoSerializer"
    }
}' \
--configuration-overrides '{
    "monitoringConfiguration": {
        "s3MonitoringConfiguration": {"logUri": "s3://<bucket_location>/elasticmapreduce/emr-containers"}
    }
}'

I think the problem is with the spark avro package I am using but I am not able to make it work, any help will be much appreciated.

0

There are 0 best solutions below