XML Parsing with Spark-XML

117 Views Asked by At

I have a XML like this:

<IdentUebersetzungen>
    <IdentUebersetzung IdentUebersetzungName="ABT">
        <Lables>
            <Lable ServiceShortName="TABROW_OperaAndDisplUnit1SparePartNumbe" LableName="SGIDK2_HW"/>
            <Lable ServiceShortName="TABROW_OperaAndDisplUnit1HardwNumbe" LableName="SGIDK2"/>
            <Lable ServiceShortName="TABROW_OperaAndDisplUnit1AppliSoftwVersiNumbe" LableName="ZIF"/>
            <Lable ServiceShortName="TABROW_OperaAndDisplUnit1HardwVersiNumbe" LableName="BRIF"/>
            <Lable ServiceShortName="TABROW_OperaAndDisplUnit1SeriaNumbe" LableName="SERNR"/>
        </Lables>
    </IdentUebersetzung>
    <IdentUebersetzung IdentUebersetzungName="Batt">
        <Lables>
            <Lable ServiceShortName="Batt_ECUHardwNumbe" LableName="SGIDK2_HW"/>
            <Lable ServiceShortName="Batt_SparePartNumbe" LableName="SGIDK2"/>
            <Lable ServiceShortName="Batt_ApplSwVerCount" LableName="ZIF"/>
            <Lable ServiceShortName="Batt_ECUHardwVersiNumbe" LableName="BRIF"/>
            <Lable ServiceShortName="Batt_SeriaNumbe" LableName="SERNR"/>
        </Lables>
    </IdentUebersetzung>
<IdentUebersetzungen>

I used Spark-XML version com.databricks:spark-xml_2.12:0.15.0

df = spark.read.format("com.databricks.spark.xml")
    .option("rowTag", IdentUebersetzung )
    .option("attributePrefix","")
    .load("xxxxxx")
df.show()

I got the following output:

+---------------------+------------+
|IdentUebersetzungName|Lables      |
+---------------------+------------+
|ABT                  |{null, null}|
|Batt                 |{null, null}|  |
+---------------------+------------+

Can someone tell me,

  1. why the cloumn "Lables" contains only null values?

  2. I want the xml attribute values of "IdentUebersetzungName", "ServiceShortName" and "LableName" in the dataframe, can I do with Spark-XML?

I tried with com.databricks:spark-xml_2.12:0.15.0, it seems that it supports nested XML not so well.

1

There are 1 best solutions below

0
On

When we are trying attributePrefix="" then parsing is not happening proeprly and It may be the bug. Otherwise you can try below code to achieve the same.

df = (spark.read.format("com.databricks.spark.xml")
          .option("rowTag", "IdentUebersetzung")
          .option("rootTag", "IdentUebersetzung")
          .option("attributePrefix", "Attr_")
          .load(filePath))

df.printSchema()

newDf = df.selectExpr("Attr_IdentUebersetzungName", "explode(Lables.Lable) as Lable")

newDf.selectExpr("Attr_IdentUebersetzungName as IdentUebersetzungName",
                     "Lable.Attr_LableName as LableName",
                     "Lable.Attr_ServiceShortName as ServiceShortName").show(truncate=False)