How to write Pyspark DataFrame to XML Format?

2.3k Views Asked by At

I'm working on a Glue ETL Job that basically reads a dataframe in Pyspark and should output data in XML Format. I've searched a lot for the solution and the code fails at the particular write statement shown below:

df.write.format('com.databricks.spark.xml').options(rowTag='book', rootTag='books').save('newbooks.xml')

The Glue Version I'm currently using is Glue 3.0 - Spark 3.1, Scala 2 and Python 3. Since I'm trying to use the Spark-XML library I have tried including the following jars as dependents in the Glue Script:

spark-xml_2.10-0.3.5,
spark-xml_2.11-0.7.0,
spark-xml_2.12-0.14.0,
spark-xml_2.13-0.14.0

The different errors I'm seeing with different versions are as follows:

An error occurred while calling o92.save. java.lang.NoClassDefFoundError: scala/runtime/java8/JFunction0$mcD$sp
An error occurred while calling o95.save. java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.mapred.DirectOutputCommitter not found

An error occurred while calling o95.save. scala/$less$colon$less

I've found a similar question posted previously by someone else and tried those approaches and they don't seem to work anymore. Has someone faced a similar issue recently? If yes, can you shed some light on the resolution?

1

There are 1 best solutions below

0
On

First see what is the Scala version for your Spark. if it is 2.11 then go with spark-xml_2.11-0.7.0 or if it is 2.12 then go with spark-xml_2.12-0.14.0 likewise the rest.

now spark-xml have the dependencies with the other jars as well. Try to use that as well with your spark-xml jar.

  1. commons-io version - 2.11.0
  2. txw2 version - 3.0.2
  3. xmlschema-core -2.3.0

note- you can try the above dependency jars with different versions as well. The jars version is suitable for the spark-xml_2.12-0.14.0

Hope this will help.

Reference - https://github.com/databricks/spark-xml/blob/master/build.sbt