Pyspark: Python Vs Spark-Submit. Error: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

56 Views Asked by DumbCoder At 16 February 2024 at 13:37

I am trying to write a pyspark dataframe to S3. This is my configuration:

spark = (SparkSession.builder
         .config('spark.master', 'local') 
         .config('spark.app.name', 'Demo')
         .config('spark.jars.packages', 
                   'org.apache.hadoop:hadoop-aws:3.2.4,org.apache.hadoop:hadoop-common:3.2.4'
                ) 
         .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
         .getOrCreate()
         )
sc = spark.sparkContext

sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", '')
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", '')

When I run my code using python app.py, the script runs fine and the data is uploaded to S3.

But when I run the same script using spark-submit app.py, I am getting the error below:

Traceback (most recent call last):
  File "/home/usr/sparkCode/EndToEndProj1/app.py", line 94, in <module>
    s3Interaction(bucketName, prefixName, fileName, write=True)
  File "/home/usr/sparkCode/EndToEndProj1/app.py", line 22, in s3Interaction
    flightsDelatDfOrgDestMaxDist.write.csv(s3Path, mode='overwrite', header=True)
  File "/home/usr/virtualenv/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1864, in csv
  File "/home/usr/virtualenv/lib/python3.10/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/home/usr/virtualenv/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
  File "/home/usr/virtualenv/lib/python3.10/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o76.csv.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
        at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:454)
        at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:530)
        at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:388)
        at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:361)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:240)
        at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:850)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
        ... 25 more

When I run it through python, I can see the code download the dependencies, but when I run it using spark-submit, i do not see it download any dependencies. I believe spark-submit is not download the .jar files and therefore raising this error.

I tried to find the answer on this thread, but it's not clear to me.

I am confused why is the execution behavior is different between two executions? How python and spark-submit are different from each other?

Original Q&A

Pyspark: Python Vs Spark-Submit. Error: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in APACHE-SPARK

Related Questions in AMAZON-S3

Related Questions in PYSPARK

Related Questions in HADOOP-PLUGINS

Trending Questions

Popular # Hahtags

Popular Questions