I want to run some XQueries on a number of HTML5 documents using QT (5.8 by now...) Now HTML/HTML5 documents (unlike XHTML/XHTML5) are non-well-formed XML documents. HTML brings a number of elements that cannot be parsed right away with XML parsers (special characters only found in html + self closed tags + ...).
I tried to use a number of html "tidy" utilities, including online services and the famous htmltidy.org binaries, which did tidy, but still did not form a well formed XML!
So the questions are:
- Is there an alternative dedicated HTML parser I'm missing here?
- Are there any proven HTML5->XML converters (I don't care if the XML does not include any of the "problematic" characters/tags. I just need the information...)
- Can HTML/HTML5 files be parsed with QT/QXmlPatterns at all? or is this a lost war???
- Any external tools that may help?
Thanks!
Using the right command line html-tidy does work!
Download the html-tidy from: http://binaries.html-tidy.org/
Use the following command line:
tidy.exe -q -b -asxml test.html > test.xml
Using the following code to use QXmlQuery on the result xml file now works fine: