On this website https://www.quebec.ca/agriculture-environnement-et-ressources-naturelles/faune/gestion-faune-habitats-fauniques/especes-fauniques-menacees-vulnerables/liste, there are tables of species that I'd like to extract.
library(rvest)
sp.list = "https://www.quebec.ca/agriculture-environnement-et-ressources-naturelles/faune/gestion-faune-habitats-fauniques/especes-fauniques-menacees-vulnerables/liste"
# Get website
wp.list = read_html(species.list)
# Extract name of sections
headers = wp.list %>% html_elements("h3") %>% html_text2() %>% .[1:24]
# Get tables
tab = read_html(sp.list) %>% html_table(header = TRUE)
# Name tables
names(tab) = headers
# Combine tables
tab.gr = dplyr::bind_rows(tab, .id = "group")
Which gives:
tab.gr
# A tibble: 180 × 3
group Espèce `Nom latin`
<chr> <chr> <chr>
1 Mollusques Anodonte du gaspareau Utterbackiana implicata
2 Mollusques Obovarie olivâtre Obovaria olivaria
3 Insectes Bourdon à tache rousse Bombus affinis
4 Insectes Coccinelle à neuf points Coccinella novemnotata
I was able to get the section headers h2, but I'm not able to associated them with each h3 sections
get.section = wp.list %>% html_nodes('.frame, .frame-default, .frame-type-textmedia, .frame-layout-0')
pas.dans.cette.page = !grepl(pattern = "Dans cette", x = get.section)
subset.listes = get.section[pas.dans.cette.page]
sections.tables = subset.listes[grep(pattern = "Liste des esp", x = subset.listes)]
sections.tables %>% html_elements("h2") %>% html_text2()
[1] "Liste des espèces menacées"
[2] "Liste des espèces vulnérables"
[3] "Liste des espèces susceptibles d’être désignées comme menacées ou vulnérables"
How then could I get the header (e.g., "Liste des espèces menacées") and its groups (e.g., "Mollusques") with their tables?
rvestis built on top ofxml2, so knowing some XPath and few (somewhat unintuitive)xml2tricks can be handy here. For example, we can build a vector of sections that matches our list of table elements by searching for a<h2>element that preceded each of those tables, basically using table elements as anchor points and tarversing back the HTML tree from each of those. As page structure changes for the last section, we need to adjust tactics a bit for those last tables, but that same strategy still applies.Another option would be iterating through each section (i.e. processing only tables in that specific section), but because of that structural change it's bit less suitable here.
Result:
Created on 2024-03-14 with reprex v2.1.0