XML parsing using spark

195 Views Asked by At

I have a table in hive with two columns id(int) and xml_column(string). xml_column is actually a xml but it is stored as string.

+------+--------------------+
|  id  |      xml_column    |
+------+--------------------+
| 6723 |<?xml version="1....|
| 6741 |<?xml version="1....|
| 6774 |<?xml version="1....|
+------+--------------------+

My question is : I would like to parse this xml and split into schema format using spark (scala). Can anyone help me out as how to handle this ? Tried data bricks spark xml library but this library handles with xml files.

Or is there any way to convert this string column to json and I have a json parser which can handle this.

1

There are 1 best solutions below

0
On

I am using spark version 2.3

Prerequisites:

  • Brickhouse udf jar
  • databricks jar
  • xml schema

You can make use of below:

    import org.apache.spark.sql._
    import com.databricks.spark.xml._;
    val sqlContext = new org.apache.spark.sql.SQLContext(sc) ;
    sql("""CREATE TEMPORARY FUNCTION numeric_range AS 'brickhouse.udf.collect.NumericRange'""") // to read the array type variables
    var df1 = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","<parent tag>").load("hdfs:<path to xml file>")
    val schema = df1.schema
    var df2 = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","<parent tag>").schema(schema).load("hdfs:<path to schema file>")
    df2.registerTempTable("df3")