Using HXT with an XML document including a specification header

Question

Using HXT with an XML document including a specification header

201 Views Asked by Rob Stewart At 31 October 2013 at 12:23

I'd like to iron out a bug the the rdf4h library that I currently maintain. It supports parsing XML/RDF documents in to RDF graphs in the XmlParser module, but does not successfully parse XML/RDF documents that include an XML specification header, e.g.

<?xml version="1.0" encoding="ISO-8859-1"?>

The parser uses HXT arrow interface, namely the Text.XML.HXT.Core module. I have boiled the problem down to two parsing attempts made in the functions testSuccess and testFailure. Both use runSLA. The author of hxt tells me that the problem lies in the use of xread , and that I should first of all be extracting the XML document from the string before xread. (Unfortunately, he hasn't responded on the GitHub issue I raised about this).

Below, there are two strings, both containing the same XML document. The xmlDoc1 string includes a specification header, which trips up the xread arrow in testFailure.

module HXTProblem where

import Text.XML.HXT.Core

data GParseState = GParseState { stateGenId :: Int } deriving(Show)

-- this document has an XML specification included
xmlDoc1 :: String
xmlDoc1 = "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>" ++
          "<shiporder orderid=\"889923\" " ++
          "xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
          "xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
          "<orderperson>John Smith</orderperson>" ++
             "<shipto>" ++
               "<name>Ola Nordmann</name>" ++
             "</shipto>" ++
          "</shiporder>"

-- this document does not include the XML specification
xmlDoc2 :: String
xmlDoc2 = "<shiporder orderid=\"889923\" " ++
          "xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
          "xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
          "<orderperson>John Smith</orderperson>" ++
             "<shipto>" ++
               "<name>Ola Nordmann</name>" ++
             "</shipto>" ++
          "</shiporder>"

initState :: GParseState
initState = GParseState { stateGenId = 0 }

-- | Works
testSuccess :: (GParseState,[XmlTree])
testSuccess = runSLA xread initState xmlDoc2

{- output of runnnig testSuccess
(GParseState {stateGenId = 0},[NTree (XTag "shiporder" [NTree (XAttr "orderid") [NTree (XText "889923") []],NTree (XAttr "xmlns:xsi") [NTree (XText "http://www.w3.org/2001/XMLSchema-instance") []],NTree (XAttr "xsi:noNamespaceSchemaLocation") [NTree (XText "shiporder.xsd") []]]) [NTree (XTag "orderperson" []) [NTree (XText "John Smith") []],NTree (XTag "shipto" []) [NTree (XTag "name" []) [NTree (XText "Ola Nordmann") []]]]]
-}

-- | Does not work
testFailure :: (GParseState,[XmlTree])
testFailure = runSLA xread initState xmlDoc1

{- ERROR running testFailure
(GParseState {stateGenId = 0},[NTree (XError 2 "\"string: \"<?xml version=\\\"1.0\\\" encoding=\\\"ISO-8859-1...\"\" (line 1, column 6):\nunexpected xml\nexpecting legal XML name character\n") []])
-}

I should add that I am looking for a solution using runSLA that will generate the same XMLTree when parsing either xmlDoc1 or xmlDoc2.

Original Q&A

There are 1 best solutions below

**Rob Stewart** · Accepted Answer · 2013-11-06T17:02:23.013000

Hurray, this is been solved. The author of the HXT library has addressed the GitHub issue added a new parser xreadDoc in this commit. I've fixed the rdf4h library version 1.2.2 and up, using this new parser in this commit, so XML/RDF documents (with spec and encoding headings) can now be parsed with the XmlParser.

Note the new arrow composition in testFailure, as (xreadDoc >>> isElem).

module HXTProblem where

import Text.XML.HXT.Core

data GParseState = GParseState { stateGenId :: Int } deriving(Show)

-- this document has an XML specification included
xmlDoc1 :: String
xmlDoc1 = "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>" ++
          "<shiporder orderid=\"889923\" " ++
          "xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
          "xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
          "<orderperson>John Smith</orderperson>" ++
             "<shipto>" ++
               "<name>Ola Nordmann</name>" ++
             "</shipto>" ++
          "</shiporder>"

-- this document does not include the XML specification
xmlDoc2 :: String
xmlDoc2 = "<shiporder orderid=\"889923\" " ++
          "xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
          "xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
          "<orderperson>John Smith</orderperson>" ++
             "<shipto>" ++
               "<name>Ola Nordmann</name>" ++
             "</shipto>" ++
          "</shiporder>"

initState :: GParseState
initState = GParseState { stateGenId = 0 }

-- | Works
testSuccess :: (GParseState,[XmlTree])
testSuccess = runSLA xread initState xmlDoc2

-- | Does also now work!
testFailure :: (GParseState,[XmlTree])
testFailure = runSLA (xreadDoc >>> isElem) initState xmlDoc1

testEquality :: Bool
testEquality =
    let (_,x) = testSuccess
        (_,y) = testFailure
    in x == y

Using HXT with an XML document including a specification header

There are 1 best solutions below

Related Questions in XML

Related Questions in HASKELL

Related Questions in HXT

Trending Questions

Popular # Hahtags

Popular Questions