I was wondering whether I could get some thoughts on the following issue:
When scraping multiple elements from multiple websites with rvest, it can easily happen that one requested html_element doesn't exist on e.g. one single site. Consequently, rvest
returns, at least in my example below, character(0).
Including such a character(0) element into a tibble does not render the pertaining
column/value NA, but renders the entire tibble to a zero row tibble (if the tibble would otherwise have only one row).
I hope the example below makes this clearer.
library(tidyverse)
library(rvest)
#Here my urls:
vec_urls <- c("https://www.noe.gv.at/noe/Achau.html", "https://www.noe.gv.at/noe/Aderklaa.html")
#Here the scraping function iterating over the urls:
fn_get_address <- function(municip_url) {
municip_name <-{{municip_url}} %>%
rvest::read_html() %>%
rvest::html_elements(., xpath="//h1[@data-cms-title='Name der Gemeinde']/text()") %>%
html_text()
municip_web <- {{municip_url}} %>%
rvest::read_html() %>%
rvest::html_elements(.,xpath="//div[contains(@class, 'col-xs-12') and .//*[contains(text(), 'Kontakt:')]]") %>%
rvest::html_elements("a") %>%
rvest::html_attr("href") %>%
stringr::str_subset(., regex("www"))
tibble("municip_name"=municip_name, "municip_web"=municip_web)
}
fn_get_address(vec_urls[1]) #works
#> # A tibble: 1 × 2
#> municip_name municip_web
#> <chr> <chr>
#> 1 Achau http://www.achau.gv.at
fn_get_address(vec_urls[2]) # doesn't work; returns 0 row tibble
#> # A tibble: 0 × 2
#> # ℹ 2 variables: municip_name <chr>, municip_web <chr>
The reason why vec_urls[2] returns a 0 row tibble is the fact that
municip_web returns character(0). It doesn't exist in the source. When combined to a tibble, the entire tibble becomes a zero-row tibble.
municip_web_2 <- vec_urls[2] %>%
rvest::read_html() %>%
rvest::html_elements(.,xpath="//div[contains(@class, 'col-xs-12') and .//*[contains(text(), 'Kontakt:')]]") %>%
rvest::html_elements("a") %>%
rvest::html_attr("href") %>%
stringr::str_subset(., regex("www"))
municip_web_2 #character(0)
#> character(0)
municip_name_2 <- vec_urls[2] %>%
rvest::read_html() %>%
rvest::html_elements(., xpath="//h1[@data-cms-title='Name der Gemeinde']/text()") %>%
html_text()
municip_name_2 #ok
#> [1] "Aderklaa"
#combining leads to zero row tibble
tibble("municip_name"=municip_name_2, "municip_web"=municip_web_2)
#> # A tibble: 0 × 2
#> # ℹ 2 variables: municip_name <chr>, municip_web <chr>
So, here's my question: Since we can never exclude the possibility that a html element returns character(0) there's always a chance that a tibble get's lost. What is your way to control for this? Unless, I am mistaken, this would also mean we should never use tibble(.... ) when scraping since we can never exclude the absence of some html elements on some page?
My current approach is the following, which works, but I was wondering whether there isn't another solution which is more straightforward.
list("municip_name"=municip_name_2, "municip_web"=municip_web_2) %>%
modify_if(., .p=is_empty, \(x) NA) %>%
enframe() %>% pivot_wider() %>%unnest(cols=everything())
#> # A tibble: 1 × 2
#> municip_name municip_web
#> <chr> <lgl>
#> 1 Aderklaa NA
I am very curious to learn how others are approaching this issue. Many thanks!
You can use the test "identical()", to check if the part of you function returns character(0), and then change that to e.g. NA