Error parsing html with extended unicode characters with basex

138 Views Asked by At

I have been facing issue with parsing html with extended unicode characters using the basex html parser. Is it possible to make the parser support special characters?

Code:

let $htmlRaw := '<span class="eqn">&#120746; + &#120747; = &#120748;</span>'
let $htmlParsed := html:parse($htmlRaw, map { 'encoding': 'utf-8'})
return (
  'INPUT', 
  $htmlRaw,
  'OUTPUT',
  $htmlParsed
)

Output:

INPUT
<span class="eqn"> +  = </span>
OUTPUT
<html>
  <body>
    <span class="eqn">?? + ?? = ??</span>
  </body>
</html>

The bug seems to be related to output-encoding parameter of tagsoup library which basex doesn't support.

for eg:-

$ echo "<span class="eqn">&#120746; + &#120747; = &#120748;</span>" | java -jar tagsoup-1.2.1.jar --html

<html><body><span class="eqn">&#55349;&#57258; + &#55349;&#57259; = &#55349;&#57260;</span>
</body></html>

$ echo "<span class="eqn">&#120746; + &#120747; = &#120748;</span>" | java -jar tagsoup-1.2.1.jar --html --output-encoding=utf-16
<html><body><span class="eqn"> +  = </span>
</body></html>
2

There are 2 best solutions below

2
On BEST ANSWER

If I add opt(writer, "encoding", Strings.UTF8); as line 156 in HtmlParser.java (https://github.com/martin-honnen/basex/commit/4711a390e4069d363243f48c95456544916f40f7) of BaseX the problems seems to go away. I am not sure, however, this is the right way to fix it.

The root of the problem seems to be two issues, TagSoup, without having the output encoding of the XMLWriter set to any Unicode encoding like UTF-8 or UTF-16, outputs two numeric character references representing an Unicode character outside of the BMP.

So you have to set UTF-8 or UTF-16 as the output encoding of TagSoup's XMLWriter as then it switches to Unicode mode and just outputs characters and not character references, with both encodings the XMLWriter of TagSoup seems to feed the right characters to the StringWriter BaseX sets up.

Furthermore, BaseX's internal String to byte[] conversion seems to expect UTF-8 encoded strings, not sure why that is the case on the Java platform, but the token function delegates work to an utf8 function.

So that way the fix in the HtmlParser seems to be to set opt(writer, "encoding", Strings.UTF8).

0
On

Martin Honnen’s answer described the issue very well. A new snapshot with the bug fix is available (https://files.basex.org/releases/latest/).

If you pass on the HTML input as string, it is already encoded in UTF-8; but the encoding option is helpful if you have binary input:

let $data := file:read-binary('my.html')
return html:parse($data, map { 'encoding': 'CP1252'})