How to control for character(0)?

81 Views Asked by At

I was wondering whether I could get some thoughts on the following issue:

When scraping multiple elements from multiple websites with rvest, it can easily happen that one requested html_element doesn't exist on e.g. one single site. Consequently, rvest returns, at least in my example below, character(0).

Including such a character(0) element into a tibble does not render the pertaining column/value NA, but renders the entire tibble to a zero row tibble (if the tibble would otherwise have only one row).

I hope the example below makes this clearer.

library(tidyverse)
library(rvest)

#Here my urls:
vec_urls <- c("https://www.noe.gv.at/noe/Achau.html", "https://www.noe.gv.at/noe/Aderklaa.html")

#Here the scraping function iterating over the urls:

fn_get_address <- function(municip_url) {

municip_name <-{{municip_url}} %>%
rvest::read_html()  %>%
rvest::html_elements(., xpath="//h1[@data-cms-title='Name der Gemeinde']/text()") %>%
html_text() 

municip_web <- {{municip_url}} %>%
rvest::read_html() %>%
rvest::html_elements(.,xpath="//div[contains(@class, 'col-xs-12') and .//*[contains(text(), 'Kontakt:')]]")  %>%
rvest::html_elements("a") %>%
rvest::html_attr("href") %>%
stringr::str_subset(., regex("www"))

tibble("municip_name"=municip_name, "municip_web"=municip_web)

}

fn_get_address(vec_urls[1]) #works
#> # A tibble: 1 × 2
#>   municip_name municip_web           
#>   <chr>        <chr>                 
#> 1 Achau        http://www.achau.gv.at

fn_get_address(vec_urls[2]) # doesn't work; returns 0 row tibble
#> # A tibble: 0 × 2
#> # ℹ 2 variables: municip_name <chr>, municip_web <chr>

The reason why vec_urls[2] returns a 0 row tibble is the fact that municip_web returns character(0). It doesn't exist in the source. When combined to a tibble, the entire tibble becomes a zero-row tibble.

municip_web_2 <- vec_urls[2]  %>%
rvest::read_html() %>%
rvest::html_elements(.,xpath="//div[contains(@class, 'col-xs-12') and .//*[contains(text(), 'Kontakt:')]]")  %>%
rvest::html_elements("a") %>%
rvest::html_attr("href") %>%
stringr::str_subset(., regex("www"))
municip_web_2 #character(0)
#> character(0)

municip_name_2 <- vec_urls[2] %>%
rvest::read_html()  %>%
rvest::html_elements(., xpath="//h1[@data-cms-title='Name der Gemeinde']/text()") %>%
html_text() 
municip_name_2 #ok
#> [1] "Aderklaa"

#combining leads to zero row tibble
tibble("municip_name"=municip_name_2, "municip_web"=municip_web_2)
#> # A tibble: 0 × 2
#> # ℹ 2 variables: municip_name <chr>, municip_web <chr>

So, here's my question: Since we can never exclude the possibility that a html element returns character(0) there's always a chance that a tibble get's lost. What is your way to control for this? Unless, I am mistaken, this would also mean we should never use tibble(.... ) when scraping since we can never exclude the absence of some html elements on some page?

My current approach is the following, which works, but I was wondering whether there isn't another solution which is more straightforward.

list("municip_name"=municip_name_2, "municip_web"=municip_web_2)  %>%
modify_if(., .p=is_empty, \(x) NA) %>%
enframe() %>% pivot_wider() %>%unnest(cols=everything())
#> # A tibble: 1 × 2
#>   municip_name municip_web
#>   <chr>        <lgl>      
#> 1 Aderklaa     NA

I am very curious to learn how others are approaching this issue. Many thanks!

2

There are 2 best solutions below

1
Jost On BEST ANSWER

You can use the test "identical()", to check if the part of you function returns character(0), and then change that to e.g. NA

fn_get_address <- function(municip_url) {
  
  municip_name <-{{municip_url}} %>%
    rvest::read_html()  %>%
    rvest::html_elements(., xpath="//h1[@data-cms-title='Name der Gemeinde']/text()") %>%
    html_text() 
  
  municip_web <- {{municip_url}} %>%
    rvest::read_html() %>%
    rvest::html_elements(.,xpath="//div[contains(@class, 'col-xs-12') and .//*[contains(text(), 'Kontakt:')]]")  %>%
    rvest::html_elements("a") %>%
    rvest::html_attr("href") %>%
    stringr::str_subset(., regex("www"))
  
  if (identical(municip_web, character(0)))
    municip_web <- NA
  
  tibble("municip_name"=municip_name, "municip_web"=municip_web)
  
}
2
margusl On

To handle cases where document structure does not include all elements for every entity, you could iterate over container elements, collect data into a list of named lists or vectors and pass it to bind_rows(). One such example can be found here - https://stackoverflow.com/a/76619623/646761

But in this particular case, it's not really an issue as all contact details are in a well-structured table. Though we can still use quite similar pattern to avoid a call to tibble() while reducing the number of requests by half:

library(rvest)
library(dplyr)

vec_urls <- c("https://www.noe.gv.at/noe/Achau.html", "https://www.noe.gv.at/noe/Aderklaa.html")

fn_get_contacts <- function(municip_url) {
  html <- read_html(municip_url)
  
  municip <- 
    html_element(html, xpath="//h1[@data-cms-title='Name der Gemeinde']/text()") |>
    html_text() |>
    setNames("municip")
  
  cont <- 
    html_element(html, xpath="//div[contains(@class, 'col-xs-12') and .//*[contains(text(), 'Kontakt:')]]/table") |>
    html_table() |>
    tibble::deframe()
  
  c(municip, cont) # return a named vector
}
  
fn_get_contacts(vec_urls[2])
#>                   municip                  Telefon:                      Fax: 
#>                "Aderklaa"         "(0 22 47) 22 90"                        "" 
#>                   E-Mail:                 Homepage: 
#> "[email protected]"                        ""

lapply(vec_urls, fn_get_contacts) |>
  bind_rows()
#> # A tibble: 2 × 5
#>   municip  `Telefon:`       `Fax:`                `E-Mail:`          `Homepage:`
#>   <chr>    <chr>            <chr>                 <chr>              <chr>      
#> 1 Achau    (0 22 36) 715 83 "(0 22 36) 715 83 33" [email protected] "www.achau…
#> 2 Aderklaa (0 22 47) 22 90  ""                    gemeinde@aderklaa… ""