http://journals.im.ac.cn/cjbcn/ch/reader/view_abstract.aspx?file_no=gc19010159&flag=1
I'd like to clean up the file from the above URL. But xmllint gives the following error. Does anybody know how to fix the problem? Thanks.
$ xmllint -html -xmlout file.html
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
ges/dh-img.jpg"><A href="../common_item.aspx?parent_id=20070610225413001&menu_id
^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
on_item.aspx?parent_id=20070610225413001&menu_id=20070610225740001&is_three_menu
^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
ges/dh-img.jpg"><A href="../common_item.aspx?parent_id=20070610225449001&menu_id
^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
on_item.aspx?parent_id=20070610225449001&menu_id=20171222045531778&is_three_menu
^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
ges/dh-img.jpg"><A href="../common_item.aspx?parent_id=20070610225428001&menu_id
^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
...
That seems to be a problem with the ampersand character used in URLs with query parameters which xmllint wants to interpret as entity reference, and then complains about because entity references in XML must be terminated by a semicolon character (unlike in SGML where a semicolon is required only if subsequent characters are name characters). You could try xmllint's "-noent" option, but I don't believe xmllint can be told to ignore entity references and suggest to use another tool to convert HTML into XML such as "sgmlproc" as described in my Parsing HTML tutorial. Dealing with ampersand chars is discussed in detail there and involves using an HTML DTD where href and other URL-typed attributes are declared such that no entity references are recognized.
Sorry for the long answer and self-promotion, but I know of no better solution for your problem. I originally intended this to be a comment but ran out of space.