Rcrawler - How to crawl account/password protected sites?

1.2k Views Asked by At

I am trying to crawl and scrape a website's tables. I have an account with the website, and I found out that Rcrawl could help me with getting parts of the table based on specific keywords, etc. The problem is that on the GitHub page there is no mentioning of how to crawl a site with account/password protection.

An example for signing in would be below:

login <- list(username="username", password="password",)

Do you have any idea if Rcrawler has this functionality? For example something like:

Rcrawler(Website = "http://www.glofile.com" +
list (username = "username", password = "password" + no_cores = 4, no_conn = 4, ExtractCSSPat = c(".entry-title",".entry-content"), PatternsNames = c("Title","Content"))

I'm confident my code above is wrong, but I hope it gives you an idea of what I want to do.

1

There are 1 best solutions below

1
SalimK On

To crawl or scrape password-protected websites in R, more precisely HTML-based Authentication, you need to use web driver to stimulate a login session, Fortunately, this is possible since Rcrawler v0.1.9, which implement phantomjs web driver ( a browser but without graphics interface).

In the following example will try to log in a blog website

 library(Rcrawler)

Dowload and install web driver

install_browser()

Run the browser session

br<- run_browser()

If you get an error than disable your antivirus or allow the program in your system setting

Run an automated login action and return a logged-in session if successful

 br<-LoginSession(Browser = br, LoginURL = 'http://glofile.com/wp-login.php'
                 LoginCredentials = c('demo','rc@pass@r'),
                 cssLoginFields =c('#user_login', '#user_pass'),
                cssLoginButton ='#wp-submit' )

Finally, if you know already the private pages you want to scrape/download use

DATA <- ContentScraper(... , browser =br)

Or, simply crawl/scrape/download all pages

Rcrawler(Website = "http://glofile.com/",no_cores = 1 ,no_conn = 1,LoggedSession = br ,...)

Don't use multiple parallel no_cores/no_conn as many websites reject multiple sessions by one user. Stay legit and honor robots.txt by setting Obeyrobots = TRUE

You access the browser functions, like :

 br$session$getUrl()
 br$session$getTitle()
 br$session$takeScreenshot(file = "image.png")