how to parse an html source code

754 Views Asked by At

I'm trying to parse a html source code. this is the webpage address I'm trying to parse. I have wrote the code below but it doesn't work at the last step that I wanna pull-out content of meta:

int main(int argc, char *argv[])
{
    QApplication a(argc, argv);
    QNetworkAccessManager manager;
    QNetworkReply *reply = manager.get(QNetworkRequest(QUrl("https://www.instagram.com/p/BTwnRykl6EM/")));
    QEventLoop event;
    QObject::connect(reply, SIGNAL(finished()), &event, SLOT(quit()));
    event.exec();
    QString me = reply->readAll();
    QString x;
    //-------------------------------------------------------------------------------------------------------
    //qDebug()<<me;
    //-------------------------------------------------------------------------------------------------------
    QXmlStreamReader reader(me);
    if(reader.readNextStartElement()){
        if(reader.name()=="html"){
            while (reader.readNextStartElement()) {
                if(reader.name()=="head"){
                    while (reader.readNextStartElement()) {
                        if(reader.name()=="meta" && reader.attributes().hasAttribute("property") && reader.attributes().value("property").toString()=="og:image")
                            x = reader.attributes().value("content").toString();
                        else{
                            qDebug()<<"why?";
                            reader.skipCurrentElement();
                        }
                    }
                }
                else
                    reader.skipCurrentElement();
            }
        }
        else
            reader.skipCurrentElement();
    }
    qDebug()<<x;
    return 0;
}

and this part doesn't work:

if(reader.name()=="meta" && reader.attributes().hasAttribute("property") && reader.attributes().value("property").toString()=="og:image")
    x = reader.attributes().value("content").toString();
else{
    qDebug()<<"why?";
    reader.skipCurrentElement();
}

and prints

why?

what is wrong with my code?

1

There are 1 best solutions below

2
On

HTML is not a valid XML, so you can't use XML parsers. Options for HTML you can find on this wiki page. Shortly, you can use Qt's Scribe framework or QtWebKit for automatic parsing and rendering HTML, or external libraries for manual parsing:

libxml2 and libhtml are C libraries, htmlcxx is C++ library, that allows build dom-tree and iterate through it.