Spark XML does not seem to work with XML Entities (such as &myentity;)

281 Views Asked by fedmest At 20 September 2019 at 18:01

I am using Spark XML to parse a large document that contains a few user-defined entities. This is a simple snippet from the file

<JMdict>
    <entry>
        <ent_seq>1000000</ent_seq>
        <r_ele>
            <reb>ヽ</reb>
        </r_ele>
        <sense>
            <pos>&unc;</pos>
            <gloss g_type="expl">repetition mark in katakana</gloss>
        </sense>
        <sense>
            <gloss xml:lang="dut">hitotsuten 一つ点: teken dat herhaling van het voorafgaande katakana-schriftteken aangeeft</gloss>
        </sense>
    </entry>
</JMdict>

The entities are correctly defined in the inline DTD that can be found in the XML document, such as here

<!ENTITY unc "unclassified">

However, the parsing fails in the schema detection phase...

root
 |-- _corrupt_record: string (nullable = true)

The reason seems to be the user-defined entities: when I escape them (such as &unc;) everything works again.

root
 |-- ent_seq: string (nullable = true)
 |-- r_ele: struct (nullable = true)
 |    |-- reb: string (nullable = true)
 |-- sense: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- gloss: struct (nullable = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _g_type: string (nullable = true)
 |    |    |    |-- _lang: string (nullable = true)
 |    |    |-- pos: string (nullable = true)

How can I address this?

Original Q&A

There are 1 best solutions below

Sean Owen On 31 August 2020 at 18:55

Yes, it's not going to do things like read ENTITY directives. The reason is that you really can't throw a regular XML parser at huge amounts of XML - or if you can, well, no need for Spark or spark-xml really.

What spark-xml does is 'parse' the XML only enough to find the few subsets of it that you are interested in, then passes that on to a full-fledges XML parser (STaX). So, within your row tag, XML should be parsed correctly. However ENTITY would be at the root of the document, so STaX won't see it.

Indeed, the use case here isn't even one big doc, but many, that could have different directives even.

Spark XML does not seem to work with XML Entities (such as &myentity;)

There are 1 best solutions below

Related Questions in XML

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in XML-ENTITIES

Related Questions in APACHE-SPARK-XML

Trending Questions

Popular # Hahtags

Popular Questions