How can we extract information from subdomain using Rcrawler in R?

842 Views Asked by At

I want to extract content of webpage from the subdomain using main URL.

I tried using Rcrawler

library(Rcrawler)

Rcrawler(Website = "http://www.xbyte-technolabs.com/", no_cores = 4, no_conn = 4, ExtractCSSPat = c(".address"))

After running this code I got INDEX default variable and we can see all URL of website. There is one URL ""http://xbyte-technolabs.com/contact_us.php" I want to extract contact details from it.

Now can someone please guide me how can I go to this particular URL from main URL ""http://xbyte-technolabs.com/" using Rcrawler in R.

2

There are 2 best solutions below

4
Otto Kässi On BEST ANSWER
library(Rcrawler)
Rcrawler(Website = "http://www.xbyte-technolabs.com/", no_cores = 1, no_conn = 1, ExtractCSSPat = c(".address"))

pageid <- as.numeric(INDEX$Id[INDEX$Url == 'http://xbyte-technolabs.com/contact_us.php'])
DATA[pageid]

According to ?Rcrawler, Rcrawler creates two global variables

  • INDEX: A data frame in global environement representing the generic URL index,including the list of fetched URLs and page details (contenttype,HTTP state, number of out-links and in-links, encoding type, and level), and

  • DATA: A list of lists in global environement holding scraped contents.

The Id variable in INDEX, corresponds to the list element in DATA. The code snippet above looks for the Id corresponding to the url you are interested in.

Sidenote: since you know the URL you are looking for, crawling through the whole website seems like an overkill.

0
Premal On
library(Rcrawler)

Rcrawler("http://www.xbyte-technolabs.com/",no_cores = 4,no_conn = 4)

for (i in length(INDEX)) {
  for (j in nrow(INDEX)) {

    Rcrawler(Website = INDEX[[i]][j], no_cores = 4, no_conn = 4, ExtractCSSPat = c(".address"))

  }

}
#Rcrawler(Website = INDEX[[i]][23], no_cores = 4, no_conn = 4, ExtractCSSPat = c(".address"))
class(DATA)
head(DATA)

ad <- DATA[[1]]
ad <- as.character(ad)
cat(ad)

Sorry I think something wrong with this code Anyone get following Error:

Error in strsplit(gsub("http://|https://|www\.", "", Website), "/")[[c(1, : subscript out of bounds