I'm trying to scrape an HTML table from a website using rvest. The only problem is that the table I'm trying to scrape doesn't have <tr>
tags, except on the first row. It looks like this:
<tr>
<td>6/21/2015 9:38 PM</td>
<td>5311 Lake Park</td>
<td>UCPD</td>
<td>African American</td>
<td>Male</td>
<td>Subject was causing a disturbance in the area.</td>
<td>Name checked; no further action</td>
<td>No</td>
</tr>
<td>6/21/2015 10:37 PM</td>
<td>5200 S Blackstone</td>
<td>UCPD</td>
<td>African American</td>
<td>Male</td>
<td>Subject was observed fighting in the McDonald's parking lot</td>
<td>Warned; released</td>
<td>No</td>
</tr>
And so on. So, using the following code, I'm only able to get the first row into my data frame:
library(rvest)
mydata <- html_session("https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015") %>%
html_node("table") %>%
html_table(header = TRUE, fill=TRUE)
How can I alter this to get html_table to understand that the rows are rows, even if they don't have an opening <tr>
tag? Or is there a better way to go about this?
Slightly different approach than @user227710, but generally the same. This, similarly, exploits the fact that the number of
TD
s is uniform.However, this also grabs all the incidents (
rbind
s each page into oneincidents
data frame).The
pblapply
just gives you progress bars since this take a few seconds. Totally not necessary unless in an interactive session.