repartition not working with xml file in Spark

108 Views Asked by At

I have dataframe which I want to save as multiple xml files. This is my code:

 employees
                .repartition(col("first_name"))
                .write()
                .option("maxRecordsPerFile", 5)
                .mode(SaveMode.Overwrite)
                .partitionBy("first_name")
                .format("xml")
                .save("C:/spark_output/");

Im expecting output to see output like this:

spark_output/
  first_name=Alex
    part-00000.xml
    part-00001.xml
  first_name=Mike
    part-00000.xml
    part-00001.xml
  first_name=Nicole
    part-00000.xml
    part-00001.xml

But the output contains only one file with 10 rows.

I don't understand what is going on? How can I fix this?

Any advice would be highly appreciated. Thanks

1

There are 1 best solutions below

0
Zach King On

.partitionBy is not supported for the spark-xml (Databricks' open source XML data sink) and does not appear to be on the roadmap for the project in GitHub

https://github.com/databricks/spark-xml/issues/327