Parsing HTML5 with QXmlQuery

647 Views Asked by At

I want to run some XQueries on a number of HTML5 documents using QT (5.8 by now...) Now HTML/HTML5 documents (unlike XHTML/XHTML5) are non-well-formed XML documents. HTML brings a number of elements that cannot be parsed right away with XML parsers (special characters only found in html + self closed tags + ...).

I tried to use a number of html "tidy" utilities, including online services and the famous htmltidy.org binaries, which did tidy, but still did not form a well formed XML!

So the questions are:

  1. Is there an alternative dedicated HTML parser I'm missing here?
  2. Are there any proven HTML5->XML converters (I don't care if the XML does not include any of the "problematic" characters/tags. I just need the information...)
  3. Can HTML/HTML5 files be parsed with QT/QXmlPatterns at all? or is this a lost war???
  4. Any external tools that may help?

Thanks!

1

There are 1 best solutions below

0
On

Using the right command line html-tidy does work!

Download the html-tidy from: http://binaries.html-tidy.org/

Use the following command line:

tidy.exe -q -b -asxml test.html > test.xml

Using the following code to use QXmlQuery on the result xml file now works fine:

#include <QFile>
#include <QXmlQuery>
#include <QBuffer>
#include <QXmlFormatter>
#include <QException>
#include <QAbstractMessageHandler>
#include <iostream>
#include <string.h>
#include <QCoreApplication>

using namespace std;

class ParserMsgHandler: public QAbstractMessageHandler
{
    virtual void handleMessage( QtMsgType type, const QString &description, const QUrl &identifier, const QSourceLocation &sourceLocation )
    {
        QString mt = "";
        switch( type )
        {
        case QtDebugMsg:
            mt = "DBG: ";
            break;
        case QtWarningMsg:
            mt = "WRN: ";
            break;
        case QtCriticalMsg:
            mt = "CRT: ";
            break;
        case QtFatalMsg:
            mt = "FTL: ";
            break;
        case QtInfoMsg:
            mt = "INF: ";
            break;
        }

        QString msg = "\r\n" +
                      mt +
                      "Line: " + sourceLocation.line() +
                      ", Column: " + sourceLocation.column() +
                      ", Id: " + identifier.toString() +
                      ", Desc: " + description;
        cout << msg.toUtf8().constData();
    }
};

ParserMsgHandler gHandler;

int main(int argc, char *argv[])
{
    QCoreApplication a( argc, argv );

    try
    {
        QString htmlFilename = a.arguments()[1];
        QString xqueryFilename = a.arguments()[2];
        QFile queryFile( xqueryFilename );
        queryFile.open( QIODevice::ReadOnly );
        const QString queryStr( QString::fromUtf8( queryFile.readAll() ) );
        QXmlQuery query;
        QFile sourceDocument;
        sourceDocument.setFileName( htmlFilename );
        sourceDocument.open( QIODevice::ReadOnly );

        QByteArray outArray;
        QBuffer buffer( &outArray );
        buffer.open( QIODevice::ReadWrite );

        query.bindVariable( "inputDocument", &sourceDocument );
        query.setQuery( queryStr );
        query.setMessageHandler( &gHandler );
        if( !query.isValid() )
        {
            cout << "\r\nError: Bad Query or Document!\r\n";
            return -1;
        }

        QXmlFormatter formatter( query, &buffer );
        if( !query.evaluateTo( &formatter ) )
        {
            cout << "\r\nError: Evaluation Failed!\r\n";
            return -2;
        }

        buffer.close();
        cout << "\r\nOutput:\r\n" << outArray.constData() << "\r\n";
        return a.exec();
    }
    catch( QException e )
    {
        cout << "\r\nExecption: " << e.what() << "\r\n";
    }
    catch( ... )
    {
        cout << "\r\nExecption: Big one...\r\n";
    }

    return 0;
}