SparkSQL with databricks xml lib: 'Malformed row'/UnboundPrefix on a valid xml

800 Views Asked by At

Suppose I'm running Spark 1.6.0 on Oracle JDK 1.8 (build 1.8.0_65-b17) in an ipython notebook session started with the following line:

PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook pyspark --packages com.databricks:spark-xml_2.10:0.3.1z

So I have included databricks spark-xml package (https://github.com/databricks/spark-xml). Next I'm going to run the following code against pyspark:

dmoz = '/Users/user/dummy.xml'
v=sqlContext.read.format('com.databricks.spark.xml').options(rowTag='Topic', failFast=True).load(dmoz)
print v.schema

where dummy.xml contains this tiny fragment of a DMOZ dump (http://rdf.dmoz.org/):

<?xml version="1.0" encoding="UTF-8"?>
<RDF xmlns:r="http://www.w3.org/TR/RDF/" xmlns:d="http://purl.org/dc/elements/1.0/" xmlns="http://dmoz.org/rdf/">
  <!-- Generated at 2016-01-24 00:05:51 EST from DMOZ 2.0 -->
  <Topic r:id="">
    <catid>1</catid>
  </Topic>
</RDF>

Which validates against any validator i've been able to find. And the result is:

...

Py4JJavaError: An error occurred while calling o82.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.RuntimeException: Malformed row (failing fast): <Topic r:id="">    <catid>1</catid>  </Topic>
    at com.databricks.spark.xml.util.InferSchema$$anonfun$3$$anonfun$apply$2.apply(InferSchema.scala:101)
    at com.databricks.spark.xml.util.InferSchema$$anonfun$3$$anonfun$apply$2.apply(InferSchema.scala:83)

...

It refers to this line of code: https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/InferSchema.scala#L101. Which is clearly the case of XMLStreamException thrown by some of the javax.xml.stream classes above.

Unfortunately, details of the exception get omitted by the handler, so I can't tell what exactly is wrong with the row. However, removing namespace from attributes (i.e. r:id becomes just id) makes it go away. I'm feeling I've hit some common pitfall, just need to know which one.

UPD: I've compiled my own jar of the databricks lib with debug statements and turns out, it is about unbound prefix:

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,16]
Message: http://www.w3.org/TR/1999/REC-xml-names-19990114#AttributePrefixUnbound?Topic&r:id&r
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596)

What is the reason and how do I fix this?

1

There are 1 best solutions below

0
On BEST ANSWER

As you described in https://github.com/databricks/spark-xml/issues/74, this is a bug and an issue.

I could reproduce this bug you meet by running below:

val testFile = "path-for-xml"
sqlContext.xmlFile(testFile, rowTag = "Topic").show()

The console output was

11:25:32.517 WARN com.databricks.spark.xml.util.InferSchema$: Dropping malformed row: <Topic r:id="">        <catid>1</catid>    </Topic>
root

I opened a PR for this, https://github.com/databricks/spark-xml/pull/75.

Currently, I submitted a PR just for this library to ignore namespaces but there might have to be some options to deal with this.

So, anyway it would be possible to read the XML file at the next release but I think we should think of a better solution to handle namespaces.