I am using Spark XML to parse a large document that contains a few user-defined entities. This is a simple snippet from the file
<JMdict>
<entry>
<ent_seq>1000000</ent_seq>
<r_ele>
<reb>ヽ</reb>
</r_ele>
<sense>
<pos>&unc;</pos>
<gloss g_type="expl">repetition mark in katakana</gloss>
</sense>
<sense>
<gloss xml:lang="dut">hitotsuten 一つ点: teken dat herhaling van het voorafgaande katakana-schriftteken aangeeft</gloss>
</sense>
</entry>
</JMdict>
The entities are correctly defined in the inline DTD that can be found in the XML document, such as here
<!ENTITY unc "unclassified">
However, the parsing fails in the schema detection phase...
root
|-- _corrupt_record: string (nullable = true)
The reason seems to be the user-defined entities: when I escape them (such as &unc;) everything works again.
root
|-- ent_seq: string (nullable = true)
|-- r_ele: struct (nullable = true)
| |-- reb: string (nullable = true)
|-- sense: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- gloss: struct (nullable = true)
| | | |-- _VALUE: string (nullable = true)
| | | |-- _g_type: string (nullable = true)
| | | |-- _lang: string (nullable = true)
| | |-- pos: string (nullable = true)
How can I address this?
Yes, it's not going to do things like read ENTITY directives. The reason is that you really can't throw a regular XML parser at huge amounts of XML - or if you can, well, no need for Spark or
spark-xmlreally.What
spark-xmldoes is 'parse' the XML only enough to find the few subsets of it that you are interested in, then passes that on to a full-fledges XML parser (STaX). So, within your row tag, XML should be parsed correctly. However ENTITY would be at the root of the document, so STaX won't see it.Indeed, the use case here isn't even one big doc, but many, that could have different directives even.