Convert HTML strings in JSON input to XML nodes

36 Views Asked by At

I'm trying to parse a json file with BaseX 10.7 thanks to the json:parse function.

My file presents some values with html characters, for example like this in "text" value:

   "order": 2,
   "page_id": 27,
   "text": "<p><strong>Présentation générale</strong></p>\r\n<p>L’ambon également nommé <em>pulpitium</em> (estrade) est une sorte de tribune élevée d’où sont proclamés les textes saints. Il est placé dans le chœur de l’église, généralement, du côté gauche.</p>\r\n<p>Dès la fin du IV<sup>e</sup> siècle, ce type de tribune, appelé <em>analogium</em>...<em>Bernard Berthod</em></h4>"

But before I even try to parse my file, when I open it in BaseX, I can see in the output window that some characters (ex : <) have been replaced by their encoding sign (becomes &lt;).

<order type="number">2</order><page__id type="number">27</page__id><text>&lt;p&gt;&lt;strong&gt;Présentation générale&lt;/strong&gt;&lt;/p&gt;&#xD;
&lt;p&gt;L’ambon également nommé &lt;em&gt;pulpitium&lt;/em&gt; (estrade) est une sorte de tribune élevée d’où sont proclamés les textes saints. Il est placé dans le chœur de l’église, généralement, du côté gauche.&lt;/p&gt;&#xD;..>

I suppose that I have to tell BaseX to accept html characters?

I tried to play with the parser options (json and html), but nothing changed...

1

There are 1 best solutions below

2
Christian Grün On

When you use json:parse, strings in the JSON structure…

{ "content": "<p>123</p>" }

…will be adopted as string values in the converted XML:

<json type="object">
  <content>&lt;p&gt;123&lt;/p&gt;</content>
</json>

The reason why the returned string representation of the XML document contains &lt; and &gt; is that the characters <, > are returned as “entity references”. Otherwise, strings with </> and elements could not be distinguished anymore.

What (I assume) you want is that XML strings in the JSON are converted to XML:

<json type="object">
  <content><p>123</p></content>
</json>

This can be done by performing an update on the generated XML document: The string value of the content element is replaced with the parsed XML structure:

let $xml := json:parse('{ "content": "<p>123</p>" }')
return $xml update {
  for $text in json/content/text()
  return replace node $text with parse-xml-fragment($text)
}

Please note that this requires that the string is well-formed XML (which is not the case for the snippet that you presented in your question).