GCS Connector on EMR failing with java.lang.ClassNotFoundException

131 Views Asked by At

I have created an emr cluster with the instructions on how to create a connection from gcs provided here and keep running the hadoop distcp command.

It keeps failing with the following error:

2023-07-25 12:00:40,113 INFO mapreduce.Job: Task Id : attempt_1690268608656_0012_m_000002_1, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2637)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3324)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3356)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:123)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3407)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3375)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:486)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
    at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:163)
    at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:809)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:348)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2541)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2635)
    ... 17 more

2023-07-25 12:00:47,139 INFO mapreduce.Job:  map 100% reduce 0%
2023-07-25 12:00:49,149 INFO mapreduce.Job: Job job_1690268608656_0012 failed with state FAILED due to: Task failed task_1690268608656_0012_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0

2023-07-25 12:00:49,216 INFO mapreduce.Job: Counters: 12
    Job Counters 
        Failed map tasks=11
        Killed map tasks=20
        Launched map tasks=12
        Other local map tasks=12
        Total time spent by all maps in occupied slots (ms)=5936160
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=61835
        Total vcore-milliseconds taken by all map tasks=61835
        Total megabyte-milliseconds taken by all map tasks=189957120
    Map-Reduce Framework
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
2023-07-25 12:00:49,218 ERROR tools.DistCp: Exception encountered 
java.io.IOException: DistCp failure: Job job_1690268608656_0012 has failed: Task failed task_1690268608656_0012_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0

    at org.apache.hadoop.tools.DistCp.waitForJobCompletion(DistCp.java:230)
    at org.apache.hadoop.tools.DistCp.execute(DistCp.java:185)
    at org.apache.hadoop.tools.DistCp.run(DistCp.java:153)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at org.apache.hadoop.tools.DistCp.main(DistCp.java:441)
2023-07-25 12:00:49,225 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...
2023-07-25 12:00:49,226 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped.
2023-07-25 12:00:49,226 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

Please comment if you need anymore information around this.

I downloaded the latest gcs-connector jar and the gcs service account json file. Then did the following steps for manual setup:

  1. Update the core-site.xml with the following dependencies: 1.
 <property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
  </property>
<property>
    <name>fs.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
    <description>The FileSystem for gs: (GCS) uris.</description>
  </property>
<property>
    <name>google.cloud.auth.service.account.json.keyfile</name>
    <value>/tmp/service_account.json</value>
  </property>
<property>
    <name>google.cloud.auth.service.account.enable</name>
    <value>true</value>
  </property>
<property>
    <name>fs.gs.status.parallel.enable</name>
    <value>true</value>
  </property>

  1. updated the hadoop_classpath with the gcs_connector location
  2. added the gcs_connector jar location in mapred-site.xml under the property mapreduce.application.classpath
  3. added the following properties for spark :
# The AbstractFileSystem for 'gs:' URIs
spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS

# Optional. Google Cloud Project ID with access to GCS buckets.
# Required only for list buckets and create bucket operations.
spark.hadoop.fs.gs.project.id=

# Whether to use a service account for GCS authorization. Setting this
# property to `false` will disable use of service accounts for authentication.
spark.hadoop.google.cloud.auth.service.account.enable=true

# The JSON keyfile of the service account used for GCS
# access when google.cloud.auth.service.account.enable is true.
spark.hadoop.google.cloud.auth.service.account.json.keyfile=/path/to/keyfile

Also I made the gcs-connector jar to be executable and have been trying with both normal jars which are found on the google docs site and have tried the latest shaded jar as well.

Also if I run the hadoop fs -ls gs://my_bucket it runs file and similarly hadoop fs -cp works fine. It only fails during map reduce

1

There are 1 best solutions below

0
On

The issue was that the gcs connector and the gcs_service_account.json were only setup in the namenode and not the workers. Had the setup everything in the workers as well. Lesson to be learnt: always use a bootstrap script when setting up emr if there are extra dependencies to be used.