From chromote to rvest

238 Views Asked by At

With chromote, I should be able to obtain the HTML codes of a website like this (if I'm not mistaken):

library(chromote)
b <- ChromoteSession$new()
b$view()
b$Page$navigate("https://www.ooir.org/")
x <- b$DOM$getDocument()

Is it possible to then use rvest to conduct basic webscraping tasks, in terms of ...

x %>%
   html_nodes("a")

Of course, the above two lines do not work.

What I want to achieve is to open a webpage in chromote to subsequently extract information with rvest.

1

There are 1 best solutions below

0
On

One method to retrieve html is through Chromote javascript evaluation:

library(chromote)
library(rvest)
b <- ChromoteSession$new()
{
  b$Page$navigate("https://www.ooir.org/")
  b$Page$loadEventFired()
} 
#> $timestamp
#> [1] 73090.44

# evaluate js in Chromeote and work with returned string
b$Runtime$evaluate("document.querySelector('html').outerHTML")$result$value %>% 
  read_html() %>% 
  html_elements("a") %>% 
  head()
#> {xml_nodeset (6)}
#> [1] <a href="index.php">\n                        <div class="float-o1">O</di ...
#> [2] <a href="index.php" class="active_menu">Trending Research</a>
#> [3] <a href="journals.php">Journal Rankings</a>
#> [4] <a href="about.php">About</a>
#> [5] <a href="#" class="clicksmall" onclick="show()"><b>Field of Research</b>: ...
#> [6] <a href="index.php?field=Agricultural+Sciences" class="clicksmall">Agricu ...

You could also work with b$DOM, missing link between that and rvest looks something like this:

x <- b$DOM$getDocument()

x$root$nodeId %>% 
  b$DOM$querySelector("html") %>% 
  `[[`(1) %>% 
  b$DOM$getOuterHTML() %>% 
  `[[`(1) %>% 
  read_html() %>% 
  html_elements("a") %>% 
  head()

#> {xml_nodeset (6)}
#> [1] <a href="index.php">\n                        <div class="float-o1">O</di ...
#> [2] <a href="index.php" class="active_menu">Trending Research</a>
#> [3] <a href="journals.php">Journal Rankings</a>
#> [4] <a href="about.php">About</a>
#> [5] <a href="#" class="clicksmall" onclick="show()"><b>Field of Research</b>: ...
#> [6] <a href="index.php?field=Agricultural+Sciences" class="clicksmall">Agricu ...

Created on 2023-05-27 with reprex v2.0.2