Harvesting data from webpage in R - accessing multiple pages

92 Views Asked by At

I am following my question from yesterday - harvesting data via drop down list in R 1

first, I need to obtain all 50k strings of details of all doctors from this page: http://www.lkcr.cz/seznam-lekaru-426.html#seznam I know, how to obtain them from a single page:

oborID<-"48"
okresID<-"3702"
web<-       "http://www.lkcr.cz/seznam-lekaru-426.html"

extractHTML<-function(oborID,okresID){
query<-list('filterObor'="107",'filterOkresId'="3201",'do[findLekar]'=1)
query$filterObor<-oborID
query$filterOkresId<-okresID
html<-      POST(url=web,body=query)
html<-      content(html, "text")
html
}


IDfromHTML<-function(html){
starting<-  unlist(gregexpr("filterId", html))
ending<-    unlist(gregexpr("DETAIL", html))
starting<-  starting[seq(2,length(starting),2)]

  if (starting != -1 && ending != -1){
    strings<-c()
    for (i in 1:length(starting)) {
  strings[i]<-substr(html,starting[i]+9,ending[i]-18)
  }
strings<-list(strings)
strings
}
}

still, I am aware that downloading whole page for only few lines of text is quite uneffective(but works!:) Could you give me a tip how to make this process more effective?

I have also encountered some pages with more than 20 doctors listed (i.e. combination of "Brno-město" and "chirurgie". Such data are listed and accessed via hyperlink list at the end of the form. I need to access each of these pages and use there the code I presented here. But I guess I have to pass some cookies there.

Other than that, combination of "Praha" and "chirurgie" is problematic as well, because there is more than 200 records, therefore page applies some script and then I need to click the button "další" and use the same method as in the previous paragraph.

Can you help me please?

0

There are 0 best solutions below