I'm trying to parse a html source code. this is the webpage address I'm trying to parse. I have wrote the code below but it doesn't work at the last step that I wanna pull-out content of meta:
int main(int argc, char *argv[])
{
QApplication a(argc, argv);
QNetworkAccessManager manager;
QNetworkReply *reply = manager.get(QNetworkRequest(QUrl("https://www.instagram.com/p/BTwnRykl6EM/")));
QEventLoop event;
QObject::connect(reply, SIGNAL(finished()), &event, SLOT(quit()));
event.exec();
QString me = reply->readAll();
QString x;
//-------------------------------------------------------------------------------------------------------
//qDebug()<<me;
//-------------------------------------------------------------------------------------------------------
QXmlStreamReader reader(me);
if(reader.readNextStartElement()){
if(reader.name()=="html"){
while (reader.readNextStartElement()) {
if(reader.name()=="head"){
while (reader.readNextStartElement()) {
if(reader.name()=="meta" && reader.attributes().hasAttribute("property") && reader.attributes().value("property").toString()=="og:image")
x = reader.attributes().value("content").toString();
else{
qDebug()<<"why?";
reader.skipCurrentElement();
}
}
}
else
reader.skipCurrentElement();
}
}
else
reader.skipCurrentElement();
}
qDebug()<<x;
return 0;
}
and this part doesn't work:
if(reader.name()=="meta" && reader.attributes().hasAttribute("property") && reader.attributes().value("property").toString()=="og:image")
x = reader.attributes().value("content").toString();
else{
qDebug()<<"why?";
reader.skipCurrentElement();
}
and prints
why?
what is wrong with my code?
HTML is not a valid XML, so you can't use XML parsers. Options for HTML you can find on this wiki page. Shortly, you can use Qt's Scribe framework or QtWebKit for automatic parsing and rendering HTML, or external libraries for manual parsing:
libxml2 and libhtml are C libraries, htmlcxx is C++ library, that allows build dom-tree and iterate through it.