Error parsing html with extended unicode characters with basex

Question

Error parsing html with extended unicode characters with basex

130 Views Asked by Sandeep S At 17 August 2025 at 21:47

I have been facing issue with parsing html with extended unicode characters using the basex html parser. Is it possible to make the parser support special characters?

Code:

let $htmlRaw := '<span class="eqn">&#120746; + &#120747; = &#120748;</span>'
let $htmlParsed := html:parse($htmlRaw, map { 'encoding': 'utf-8'})
return (
  'INPUT', 
  $htmlRaw,
  'OUTPUT',
  $htmlParsed
)

Output:

INPUT
<span class="eqn"> +  = </span>
OUTPUT
<html>
  <body>
    <span class="eqn">?? + ?? = ??</span>
  </body>
</html>

The bug seems to be related to output-encoding parameter of tagsoup library which basex doesn't support.

for eg:-

$ echo "<span class="eqn">&#120746; + &#120747; = &#120748;</span>" | java -jar tagsoup-1.2.1.jar --html

<html><body><span class="eqn">&#55349;&#57258; + &#55349;&#57259; = &#55349;&#57260;</span>
</body></html>

$ echo "<span class="eqn">&#120746; + &#120747; = &#120748;</span>" | java -jar tagsoup-1.2.1.jar --html --output-encoding=utf-16
<html><body><span class="eqn"> +  = </span>
</body></html>

Original Q&A

There are 2 best solutions below

Christian Grün On 23 June 2021 at 14:52

Martin Honnen’s answer described the issue very well. A new snapshot with the bug fix is available (https://files.basex.org/releases/latest/).

If you pass on the HTML input as string, it is already encoded in UTF-8; but the encoding option is helpful if you have binary input:

let $data := file:read-binary('my.html')
return html:parse($data, map { 'encoding': 'CP1252'})

**Martin Honnen** · Accepted Answer

If I add opt(writer, "encoding", Strings.UTF8); as line 156 in HtmlParser.java (https://github.com/martin-honnen/basex/commit/4711a390e4069d363243f48c95456544916f40f7) of BaseX the problems seems to go away. I am not sure, however, this is the right way to fix it.

The root of the problem seems to be two issues, TagSoup, without having the output encoding of the XMLWriter set to any Unicode encoding like UTF-8 or UTF-16, outputs two numeric character references representing an Unicode character outside of the BMP.

So you have to set UTF-8 or UTF-16 as the output encoding of TagSoup's XMLWriter as then it switches to Unicode mode and just outputs characters and not character references, with both encodings the XMLWriter of TagSoup seems to feed the right characters to the StringWriter BaseX sets up.

Furthermore, BaseX's internal String to byte[] conversion seems to expect UTF-8 encoded strings, not sure why that is the case on the Java platform, but the token function delegates work to an utf8 function.

So that way the fix in the HtmlParser seems to be to set opt(writer, "encoding", Strings.UTF8).

Error parsing html with extended unicode characters with basex

There are 2 best solutions below

Related Questions in HTML

Related Questions in XQUERY

Related Questions in BASEX

Related Questions in TAG-SOUP

Trending Questions

Popular # Hahtags

Popular Questions