R: Scraping multiple tables in URL

665 Views Asked by At

I'm learning how to scrape information from websites using httr and XML in R. I'm getting it to work just fine for websites with just a few tables, but can't figure it out for websites with several tables. Using the following page from pro-football-reference as an example: https://www.pro-football-reference.com/boxscores/201609110atl.htm

# To get just the boxscore by quarter, which is the first table:
URL = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
URL = GET(URL)
SnapTable = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)[[1]]

# Return the number of tables:
AllTables = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)
length(AllTables)
[1] 2

So I'm able to scrape info, but for some reason I can only capture the top two tables out of the 20+ on the page. For practice, I'm trying to get the "Starters" tables and the "Officials" tables.

Is my inability to get the other tables a matter of the website's setup or incorrect code?

1

There are 1 best solutions below

0
On

If it comes down to web scraping in R make intensive use of the package rvest.

While managing to get the html is just about fine - rvest makes use of css selectors - SelectorGadget helps finding a pattern in styling for a particular table which is hopefully unique. Therefore you can extract exactly the tables you are looking for instead of coincidence

To get you started - read the vignette on rvest for more detailed information.

#install.packages("rvest")
library(rvest)
library(magrittr)

# Store web url
fb_url = "https://www.pro-football-reference.com/boxscores/201609080den.htm"

linescore = fb_url %>%
    read_html() %>%
    html_node(xpath = '//*[@id="content"]/div[3]/table') %>%
    html_table()

Hope this helps.