I'm trying to programatically search a website, but the submit button functionality seems to be primarily powered by JavaScript. I'm not overly familiar with how this works though, so I could be wrong.
Here is the code I'm using:
library(rvest)
BASE_URL = 'https://mdocweb.state.mi.us/otis2/otis2.aspx'
PARAMS = list(txtboxLName='Smith',
drpdwnGender='Either',
drpdwnRace='All',
drpdwnStatus='All',
submit='btnSearch')
# rvest approach
s = html_session(BASE_URL)
form = html_form(s)[[1]]
form = set_values(form, PARAMS)
resp = submit_form(s, form, submit='btnSearch') # This gives an error
# httr approach
resp = httr::POST(BASE_URL, body=PARAMS, encode='form')
html = httr::content(resp) # This just returns that same page I was on
The HTML for the button looks like this:
<input type="submit" name="btnSearch" value="Search" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("btnSearch", "", true, "", "", false, false))" language="javascript" id="btnSearch" style="width:100px;">
Given the onclick
attribute, my uneducated assumption is that the use of JavaScript is what is interfering with my approach. But again, I don't fully understand how all this works, so I could be wrong.
Either way, how do I achieve my goal, if at all, using rvest
or httr
, but not RSelenium
? Also, if this is achievable in Python, I'll accept that as well.
We first need to get the original search page since this is a sharepoint site (or acts like one) and we need some hidden form fields to use later on:
Now, we need to act like the form and use HTTP
POST
to submit it:We're going to need this helper function in a minute:
Now, we need the HTML from the results page:
Unfortunately, the "table" is really a set of
<div>
s. But, it's programmatically generated and pretty uniform. We don't want to type much so let's first get the column names we'll be using later on:The site is pretty nice in that it accommodates folks with disabilities by providing screen-reader hints. Unfortunately, this puts a kink in scraping since we wld either have to be verbose in targeting the tags with values or clean up text later on. Thankfully, the
xml2
now has the ability to remove nodes:We can now collect all the offender records
<div>
"rows":And, succinctly get them into a data frame:
I had hoped the
type_convert
wld provide better transforms, esp for the date column(s) but it didn't and can likely be eliminated.Now, you'll need to do some more work with the results page since since the results are paginated. Thankfully, you know the page info:
You'll have to do the "hidden" dance again:
(follow above ref for what to do with that) and rejigger a new
POST
call that only has those hidden fields and one more form element:btnNext = 'Next'
. You'll need to repeat this over all the individual pages in the paginated result set then finallybind_rows()
everything.I shld add that as you figure out the pagination workflow, start with a fresh blank search page grab. The sharepoint server seems to be configured with a pretty small viewstate session cache timeout and code will break if you wait too long between iterations.
UPDATE
I kinda wanted to make sure that last bit of advice worked so there's this:
Hopefully you can follow along, but that shd get all the pages for you for a given search term.