I am trying to crawl and scrape a website's tables. I have an account with the website, and I found out that Rcrawl could help me with getting parts of the table based on specific keywords, etc. The problem is that on the GitHub page there is no mentioning of how to crawl a site with account/password protection.
An example for signing in would be below:
login <- list(username="username", password="password",)
Do you have any idea if Rcrawler has this functionality? For example something like:
Rcrawler(Website = "http://www.glofile.com" +
list (username = "username", password = "password" + no_cores = 4, no_conn = 4, ExtractCSSPat = c(".entry-title",".entry-content"), PatternsNames = c("Title","Content"))
I'm confident my code above is wrong, but I hope it gives you an idea of what I want to do.
To crawl or scrape password-protected websites in R, more precisely HTML-based Authentication, you need to use web driver to stimulate a login session, Fortunately, this is possible since Rcrawler v0.1.9, which implement phantomjs web driver ( a browser but without graphics interface).
In the following example will try to log in a blog website
Dowload and install web driver
Run the browser session
If you get an error than disable your antivirus or allow the program in your system setting
Run an automated login action and return a logged-in session if successful
Finally, if you know already the private pages you want to scrape/download use
Or, simply crawl/scrape/download all pages
Don't use multiple parallel no_cores/no_conn as many websites reject multiple sessions by one user. Stay legit and honor robots.txt by setting Obeyrobots = TRUE
You access the browser functions, like :