As I'm trying to familiarize myself with rvest
and scrape baseball standings, @Cory kindly pointed me to a site with one table per division. (In baseball, 2 leagues x 3 divisions each = 6 tables).
library("rvest"); library("xml2")
read_html("http://sports.yahoo.com/mlb/standings/") %>%
html_nodes(".yui3-tabview-content") %>%
html_nodes("table") %>% html_table -> standings
But these tables do not include columns for league and division -- that information is section headings <h4>
and <h5>
above the tables.
read_html("http://sports.yahoo.com/mlb/standings/") %>%
html_nodes(".yui3-tabview-content") %>%
html_nodes("h4") %>% html_text -> leagues
leagues # [1] "American League" "National League"
read_html("http://sports.yahoo.com/mlb/standings/") %>%
html_nodes(".yui3-tabview-content") %>%
html_nodes("h5") %>% html_text -> divs
divs # [1] "East" "Central" "West" "East" "Central" "West"
I know that I can semi-manually assign the league and division:
for (i in 1:6){
standings[[i]]$League <- as.factor( leagues[ceiling(i/3)])
standings[[i]]$Division <- as.factor(divs[i])
}
standings <- do.call(rbind, standings) # desired output
I'm fine with manual assignment because I doubt this structure will change... but it got me thinking .. Is there a clever way to have each table inherit/look-back the most recent values of <h4>
and <h5>
and store as columns?
TYVM
If you look at the
xml_children
we are working with... the headers aren't "parents" of the tables...So, looping through that structure should hold in the case that they switch the order of AL or NL or something...