`-:55: HTML parser error : htmlParseEntityRef: expecting ';'`: clean up HTML file with xmllint?

Question

`-:55: HTML parser error : htmlParseEntityRef: expecting ';'`: clean up HTML file with xmllint?

612 Views Asked by user1424739 At 01 November 2019 at 23:09

http://journals.im.ac.cn/cjbcn/ch/reader/view_abstract.aspx?file_no=gc19010159&flag=1

I'd like to clean up the file from the above URL. But xmllint gives the following error. Does anybody know how to fix the problem? Thanks.

$ xmllint -html -xmlout file.html
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
ges/dh-img.jpg"><A href="../common_item.aspx?parent_id=20070610225413001&menu_id
                                                                               ^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
on_item.aspx?parent_id=20070610225413001&menu_id=20070610225740001&is_three_menu
                                                                               ^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
ges/dh-img.jpg"><A href="../common_item.aspx?parent_id=20070610225449001&menu_id
                                                                               ^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
on_item.aspx?parent_id=20070610225449001&menu_id=20171222045531778&is_three_menu
                                                                               ^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
ges/dh-img.jpg"><A href="../common_item.aspx?parent_id=20070610225428001&menu_id
                                                                               ^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
...

Original Q&A

There are 2 best solutions below

**imhotap** · Answer 1 · 2019-11-01T23:49:06.557000

That seems to be a problem with the ampersand character used in URLs with query parameters which xmllint wants to interpret as entity reference, and then complains about because entity references in XML must be terminated by a semicolon character (unlike in SGML where a semicolon is required only if subsequent characters are name characters). You could try xmllint's "-noent" option, but I don't believe xmllint can be told to ignore entity references and suggest to use another tool to convert HTML into XML such as "sgmlproc" as described in my Parsing HTML tutorial. Dealing with ampersand chars is discussed in detail there and involves using an HTML DTD where href and other URL-typed attributes are declared such that no entity references are recognized.

Sorry for the long answer and self-promotion, but I know of no better solution for your problem. I originally intended this to be a comment but ran out of space.

**Tom O'Hara** · Answer 2 · 2024-02-16T22:02:28.727000

This is for future reference: it turns out that encoding the '&' as entity resolves the htmlParseEntityRef problem in the particular HTML file from the Chinese journal.

A simple example follows illustrating a workaround via perl:

$ cat bad-simple.html 
<!DOCTYPE HTML>
<html lang="en">
  <head>
    <title>bad URL links </title>
  </head>
  <body>
    <a href="http://www.fubar.com?fubar=1&fu=0&bar=0">fubar</a>
  </body>
</html>
$ 
$ xmllint --html --noout bad-simple.html
bad-simple.html:7: HTML parser error : htmlParseEntityRef: expecting ';'
    <a href="http://www.fubar.com?fubar=1&fu=0&bar=0">fubar</a>
                                            ^
bad-simple.html:7: HTML parser error : htmlParseEntityRef: expecting ';'
    <a href="http://www.fubar.com?fubar=1&fu=0&bar=0">fubar</a>
                                                  ^
$ perl -pe 's/\&(?!amp)/&amp;/g;' bad-simple.html >| better-simple.html
$ xmllint --html --noout better-simple.html
$ 
$ diff bad-simple.html better-simple.html 
7c7
<     <a href="http://www.fubar.com?fubar=1&fu=0&bar=0">fubar</a>
---
>     <a href="http://www.fubar.com?fubar=1&amp;fu=0&amp;bar=0">fubar</a>

`-:55: HTML parser error : htmlParseEntityRef: expecting ';'`: clean up HTML file with xmllint?

There are 2 best solutions below

Related Questions in XML

Related Questions in XML-PARSING

Related Questions in TIDY

Related Questions in XMLLINT

Related Questions in HTMLTIDY

Trending Questions

Popular # Hahtags

Popular Questions