R - Scraping an HTML table with rvest when there are missing <tr> tags

Question

R - Scraping an HTML table with rvest when there are missing <tr> tags

5k Views Asked by jonahshai At 22 June 2015 at 20:48

I'm trying to scrape an HTML table from a website using rvest. The only problem is that the table I'm trying to scrape doesn't have <tr> tags, except on the first row. It looks like this:

<tr> 
  <td>6/21/2015 9:38 PM</td>
  <td>5311 Lake Park</td>
  <td>UCPD</td>
  <td>African American</td>
  <td>Male</td>
  <td>Subject was causing a disturbance in the area.</td>
  <td>Name checked; no further action</td>
  <td>No</td>
</tr>

  <td>6/21/2015 10:37 PM</td>
  <td>5200 S Blackstone</td>
  <td>UCPD</td>
  <td>African American</td>
  <td>Male</td>
  <td>Subject was observed fighting in the McDonald's parking lot</td>
  <td>Warned; released</td>
  <td>No</td>
</tr>

And so on. So, using the following code, I'm only able to get the first row into my data frame:

library(rvest)
mydata <- html_session("https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015") %>%
    html_node("table") %>%
    html_table(header = TRUE, fill=TRUE)

How can I alter this to get html_table to understand that the rows are rows, even if they don't have an opening <tr> tag? Or is there a better way to go about this?

Original Q&A

There are 3 best solutions below

**user227710** · Answer 1 · 2015-06-22T21:11:09.067000

library(rvest)

url_parse<- read_html("https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015") 

col_name<- url_parse %>%
  html_nodes("th") %>%
  html_text()

mydata <- url_parse %>%
  html_nodes("td") %>%
  html_text()

finaldata <- data.frame(matrix(mydata, ncol=7, byrow=TRUE))

names(finaldata) <- col_name

finaldata

                     Incident                                  Location    

    Reported                              Occurred
1                           Theft       1115 E. 58th St. (Walker Bike Rack) 6/1/15 12:18 PM 5/31/15 to 6/1/15 8:00 PM to 12:00 PM
2                     Information                          5835 S. Kimbark   6/1/15 3:57 PM                        6/1/15 3:55 PM
3                     Information                  1025 E. 58th St. (Swift)  6/2/15 2:18 AM                        6/2/15 2:18 AM
4 Non-Criminal Damage to Property                850 E. 63rd St. (Car Wash)  6/2/15 8:48 AM                        6/2/15 8:00 AM
5     Criminal Damage to Property 5631 S. Cottage Grove (Parking Structure)  6/2/15 7:32 PM             6/2/15 6:45 PM to 7:30 PM
                                                                                                                   Comments / Nature of Fire Disposition
1                                                                                       Bicycle secured to bike rack taken by unknown person        Open
2             Unknown person used staff member's personal information to file a fraudulent claim with U.S. Social Security Admin. / CPD case         CPD
3 Three unaffiliated individuals reported tampering with bicycles in bike rack / Subjects were given trespass warnings and sent on their way      Closed
4                                                                      Rear wiper blade assembly damaged on UC owned vehicle during car wash      Closed
5                                                           Unknown person(s) spray painted graffiti on north concrete wall of the structure        Open
  UCPDI#
1 E00344
2 E00345
3 E00346
4 E00347
5 E00348

**hrbrmstr** · Answer 2 · 2015-06-22T21:35:41.273000

Slightly different approach than @user227710, but generally the same. This, similarly, exploits the fact that the number of TDs is uniform.

However, this also grabs all the incidents (rbinds each page into one incidents data frame).

The pblapply just gives you progress bars since this take a few seconds. Totally not necessary unless in an interactive session.

library(rvest)
library(stringr)
library(dplyr)
library(pbapply)

url <- "https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015"
pg <- read_html(url)

pg %>% 
  html_nodes("li.page-count") %>% 
  html_text() %>% 
  str_trim() %>% 
  str_split(" / ") %>%
  unlist %>% 
  as.numeric %>% 
  .[2] -> total_pages

pblapply(1:(total_pages), function(j) {

  # get "column names"
  # NOTE that you get legit column names for use with "regular" 
  # data frames this way

  pg %>% 
    html_nodes("thead > tr > th") %>% 
    html_text() %>% 
    make.names -> tcols

  # get all the TDs

  pg %>% 
    html_nodes("td") %>%
    as_list() -> tds

  # how many rows do we have? (shld be 5, but you never know)

  trows <- length(tds) / 7

  # the basic idea is to grab all the TDs for each row
  # then cbind them together and then rbind the whole thing
  # while keeping decent column names

  bind_rows(lapply(1:trows, function(i) {
    setNames(cbind.data.frame(lapply(1:7, function(j) { 
      html_text(tds[[(i-1)*7 + j]])
    }), stringsAsFactors=FALSE), tcols)
  })) -> curr_tbl

  # get next url

  pg %>% 
    html_nodes("li.next > a") %>% 
    html_attr("href") -> next_url

  if (j < total_pages) {
    pg <<- read_html(sprintf("https://incidentreports.uchicago.edu/%s", next_url))
  }

  curr_tbl

}) %>% bind_rows -> incidents

incidents

## Source: local data frame [62 x 7]
## 
##                            Incident                                  Location        Reported
## 1                             Theft       1115 E. 58th St. (Walker Bike Rack) 6/1/15 12:18 PM
## 2                       Information                          5835 S. Kimbark   6/1/15 3:57 PM
## 3                       Information                  1025 E. 58th St. (Swift)  6/2/15 2:18 AM
## 4   Non-Criminal Damage to Property                850 E. 63rd St. (Car Wash)  6/2/15 8:48 AM
## 5       Criminal Damage to Property 5631 S. Cottage Grove (Parking Structure)  6/2/15 7:32 PM
## 6  Information / Aggravated Robbery                4701 S. Ellis (Public Way)  6/3/15 2:11 AM
## 7                     Lost Property           5800 S. University  (Main Quad)  6/3/15 8:30 AM
## 8       Criminal Damage to Property         5505 S. Ellis (Parking Structure) 5/29/15 5:00 PM
## 9       Information / Armed Robbery        6300 S. Cottage Grove (Public Way)  6/3/15 2:33 PM
## 10                    Lost Property                1414 E. 59th St. (I-House)  6/3/15 2:28 PM
## ..                              ...                                       ...             ...
## Variables not shown: Occurred (chr), Comments...Nature.of.Fire (chr), Disposition (chr), UCPDI. (chr)

**jonahshai** · Answer 3 · 2015-06-23T21:19:11.913000

Thanks everyone! I ended up getting some help from another R user off line who suggested the following solution. It takes the html, saves it, adds in the <tr> (much like @Bram Vanroy suggested), and turns it back into an html object, which can then be scraped into a dataframe.

library(rvest)
myurl <- "https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015"
download.file(myurl, destfile="myfile.html", method="curl")
myhtml <- readChar("myfile.html", file.info("myfile.html")$size)
myhtml <- gsub("</tr>", "</tr><tr>", myhtml, fixed = TRUE)
mydata <- html(myhtml)

mydf <- mydata %>%
  html_node("table") %>%
  html_table(fill = TRUE)

mydf <- na.omit(mydf)

The last line is to omit some weird NA rows that show up with this method.

R - Scraping an HTML table with rvest when there are missing <tr> tags

There are 3 best solutions below

Related Questions in HTML

Related Questions in R

Related Questions in HTML-TABLE

Related Questions in RVEST

Trending Questions

Popular # Hahtags

Popular Questions