How to retrieve a multiple tables from a webpage using R

222 Views Asked by At

I want to extract all vaccine tables with the description on the left and their description inside the table using R,

this is the link for the webpage

this is how the first table look on the webpage:

enter image description here

I tried using XML package, but I wasn't succeful, I used:

vup<-readHTMLTable("https://milken-institute-covid-19-tracker.webflow.io/#vaccines_intro", which=5)

I get an error:


Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
In addition: Warning message:
XML content does not seem to be XML: '' 

How to do this?

1

There are 1 best solutions below

3
On BEST ANSWER

This webpage does not use a tables thus the reason for your error. Due to the multiple subsections and hidden text, the formatting on the page is quite complicated and requires finding the nodes of interest individually.

I prefer using the "rvest" and "xml2" package for the easier and more straight forward syntax.
This is not a complete solution and should get you moving in the correct direction.

library(rvest)
library(dplyr)

#find the top of the vacine section
parentvaccine <- page %>% html_node(xpath="//div[@id='vaccines_intro']") %>% xml_parent()

#find the vacine rows
vaccines <- parentvaccine %>% html_nodes(xpath = ".//div[@class='chart_row for_vaccines']")

#find info on each one
company <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_developer w-richtext']") %>% html_text()
product <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_vaccines w-richtext']") %>% html_text()
phase <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_stage']") %>% html_text()
misc <- vaccines %>% html_node(xpath = ".//div[@class='chart_row-expanded for_vaccines']") %>% html_text()


#determine vacine type
#Get vacine type
vaccinetypes <- parentvaccine %>% html_nodes(xpath = './/div[@class="chart-section for_vaccines"]') %>% 
   html_node('div.is_h3') %>% html_text()
#dtermine the number of vacines in each category
lengthvector <-parentvaccine %>% html_nodes(xpath = './/div[@role="list"]') %>% xml_length() %>% sum()
#make vector of correct length
VaccineType <- rep(vaccinetypes, each=lengthvector)

answer <- data.frame(VaccineType,  company, product, phase)
head(answer)

To generate this code, involved reading the html code and identifying the correct nodes and the unique attributes for the desired information.