I am trying to read XML data from Kafka topic using Spark Structured streaming.
I tried using the Databricks spark-xml package, but I got an error saying that this package does not support streamed reading. Is there any way I can extract XML data from Kafka topic using structured streaming?
My current code:
df = spark \
.readStream \
.format("kafka") \
.format('com.databricks.spark.xml') \
.options(rowTag="MainElement")\
.option("kafka.bootstrap.servers", "localhost:9092") \
.option(subscribeType, "test") \
.load()
The error:
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.UnsupportedOperationException: Data source com.databricks.spark.xml does not support streamed reading
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:234)
The last one with
com.databricks.spark.xmlwins and becomes the streaming source (hiding Kafka as the source).In order words, the above is equivalent to
.format('com.databricks.spark.xml')alone.As you may have experienced, the Databricks
spark-xmlpackage does not support streaming reading (i.e. cannot act as a streaming source). The package is not for streaming.You are left with accessing and processing the XML yourself with a standard function or a UDF. There's no built-in support for streaming XML processing in Structured Streaming up to Spark 2.2.0.
That should not be a big deal anyway. A Scala code could look as follows.
Another possible solution could be to write your own custom streaming Source that would deal with the XML format in
def getBatch(start: Option[Offset], end: Offset): DataFrame. That is supposed to work.