I'm working on a Glue ETL Job that basically reads a dataframe in Pyspark and should output data in XML Format. I've searched a lot for the solution and the code fails at the particular write statement shown below:
df.write.format('com.databricks.spark.xml').options(rowTag='book', rootTag='books').save('newbooks.xml')
The Glue Version I'm currently using is Glue 3.0 - Spark 3.1, Scala 2 and Python 3. Since I'm trying to use the Spark-XML library I have tried including the following jars as dependents in the Glue Script:
spark-xml_2.10-0.3.5,
spark-xml_2.11-0.7.0,
spark-xml_2.12-0.14.0,
spark-xml_2.13-0.14.0
The different errors I'm seeing with different versions are as follows:
An error occurred while calling o92.save. java.lang.NoClassDefFoundError: scala/runtime/java8/JFunction0$mcD$sp
An error occurred while calling o95.save. java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.mapred.DirectOutputCommitter not found
An error occurred while calling o95.save. scala/$less$colon$less
I've found a similar question posted previously by someone else and tried those approaches and they don't seem to work anymore. Has someone faced a similar issue recently? If yes, can you shed some light on the resolution?
First see what is the Scala version for your Spark. if it is 2.11 then go with spark-xml_2.11-0.7.0 or if it is 2.12 then go with spark-xml_2.12-0.14.0 likewise the rest.
now spark-xml have the dependencies with the other jars as well. Try to use that as well with your spark-xml jar.
note- you can try the above dependency jars with different versions as well. The jars version is suitable for the spark-xml_2.12-0.14.0
Hope this will help.
Reference - https://github.com/databricks/spark-xml/blob/master/build.sbt