I am trying to scrape a webpage which uses angular.js. My understanding is that the only option in R is to use RSelenium to load the page first, and then parse the content. However, I find rvest
more intuitive than RSelenium to parse the content, thus I would like to work as little as possible with RSelenium and then switch to rvest
as soon as I can.
So far I have realized that I probably need at least to use RSelenium to connect and download the html code using htmlTreeParse
. Suppose this is part of my output:
structure(list(name = "div", attributes = structure(c("im_dialog_date",
"dialogMessage.dateText"), .Names = c("class", "ng-bind")), children = structure(list(
text = structure(list(name = "text", attributes = NULL, children = NULL,
namespace = NULL, namespaceDefinitions = NULL, value = "6:52 PM"), .Names = c("name",
"attributes", "children", "namespace", "namespaceDefinitions",
"value"), class = c("XMLTextNode", "XMLNode", "RXMLAbstractNode",
"XMLAbstractNode", "oldClass"))), .Names = "text"), namespace = NULL,
namespaceDefinitions = NULL), .Names = c("name", "attributes",
"children", "namespace", "namespaceDefinitions"), class = c("XMLNode",
"RXMLAbstractNode", "XMLAbstractNode", "oldClass"))
How can I pass it to rvest::read_html()
?
If you look at the class of your item, it's an
XMLNode
, which is a class defined by theXML
package. In it, it defines a method fortoString
(but notas.character
, curiously) that allows you to convert the node to an ordinary string, which can in turn be read in byxml2::read_html
:That said, I normally just use the
RSelenium::remoteDriver
'sgetPageSource()
method to just grab all the HTML, which is then easily parsed with rvest.