I'd like to iron out a bug the the rdf4h library that I currently maintain. It supports parsing XML/RDF documents in to RDF graphs in the XmlParser module, but does not successfully parse XML/RDF documents that include an XML specification header, e.g.
<?xml version="1.0" encoding="ISO-8859-1"?>
The parser uses HXT arrow interface, namely the Text.XML.HXT.Core module. I have boiled the problem down to two parsing attempts made in the functions testSuccess and testFailure. Both use runSLA. The author of hxt tells me that the problem lies in the use of xread , and that I should first of all be extracting the XML document from the string before xread. (Unfortunately, he hasn't responded on the GitHub issue I raised about this).
Below, there are two strings, both containing the same XML document. The xmlDoc1 string includes a specification header, which trips up the xread arrow in testFailure.
module HXTProblem where
import Text.XML.HXT.Core
data GParseState = GParseState { stateGenId :: Int } deriving(Show)
-- this document has an XML specification included
xmlDoc1 :: String
xmlDoc1 = "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>" ++
"<shiporder orderid=\"889923\" " ++
"xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
"xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
"<orderperson>John Smith</orderperson>" ++
"<shipto>" ++
"<name>Ola Nordmann</name>" ++
"</shipto>" ++
"</shiporder>"
-- this document does not include the XML specification
xmlDoc2 :: String
xmlDoc2 = "<shiporder orderid=\"889923\" " ++
"xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
"xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
"<orderperson>John Smith</orderperson>" ++
"<shipto>" ++
"<name>Ola Nordmann</name>" ++
"</shipto>" ++
"</shiporder>"
initState :: GParseState
initState = GParseState { stateGenId = 0 }
-- | Works
testSuccess :: (GParseState,[XmlTree])
testSuccess = runSLA xread initState xmlDoc2
{- output of runnnig testSuccess
(GParseState {stateGenId = 0},[NTree (XTag "shiporder" [NTree (XAttr "orderid") [NTree (XText "889923") []],NTree (XAttr "xmlns:xsi") [NTree (XText "http://www.w3.org/2001/XMLSchema-instance") []],NTree (XAttr "xsi:noNamespaceSchemaLocation") [NTree (XText "shiporder.xsd") []]]) [NTree (XTag "orderperson" []) [NTree (XText "John Smith") []],NTree (XTag "shipto" []) [NTree (XTag "name" []) [NTree (XText "Ola Nordmann") []]]]]
-}
-- | Does not work
testFailure :: (GParseState,[XmlTree])
testFailure = runSLA xread initState xmlDoc1
{- ERROR running testFailure
(GParseState {stateGenId = 0},[NTree (XError 2 "\"string: \"<?xml version=\\\"1.0\\\" encoding=\\\"ISO-8859-1...\"\" (line 1, column 6):\nunexpected xml\nexpecting legal XML name character\n") []])
-}
I should add that I am looking for a solution using runSLA that will generate the same XMLTree when parsing either xmlDoc1 or xmlDoc2.
Hurray, this is been solved. The author of the HXT library has addressed the GitHub issue added a new parser
xreadDocin this commit. I've fixed the rdf4h library version 1.2.2 and up, using this new parser in this commit, so XML/RDF documents (with spec and encoding headings) can now be parsed with theXmlParser.Note the new arrow composition in
testFailure, as(xreadDoc >>> isElem).